What It Actually Costs to Run AI in Production

We run AI chat on hej.chat. Real conversations, real customers, real invoices. So we measured what a typical support interaction actually costs.

One question ("Do you ship to Germany?"), one knowledge base search, one answer. Total: 4,363 tokens. Cost: $0.0024. That's a quarter of a penny.

A human support agent handling the same ticket costs roughly $2. The AI is 830x cheaper. But the model you pick matters, and the token math is not what most people assume. Here's the full breakdown.

What a real conversation looks like

Most cost estimates assume a roughly even split between input and output tokens. That's wrong. Here's what our production data actually shows:

Token split from a real support conversation

98% input, 2% output. That's not a typo. Here's where those 4,270 input tokens come from:

System prompt: ~500 tokens (personality, instructions, response guidelines)
Tool definitions: ~200 tokens (the searchKnowledgeBase function schema)
Knowledge base results: ~3,400 tokens (crawled page content returned by the tool)
User message: ~10 tokens ("Do you ship to Germany")

The output? A clean 93-token answer: shipping costs, delivery times, customs info. Done.

This ratio has a huge practical consequence: input pricing matters way more than output pricing for chat applications. A model with cheap input tokens will beat a model with cheap output tokens every time.

The pricing landscape

Here are the seven models most relevant for production chat, sorted cheapest to most expensive:

API pricing comparison across seven models

Model	Provider	Input / 1M tokens	Output / 1M tokens
Gemini 2.0 Flash-Lite	Google	$0.075	$0.30
GPT-4o mini	OpenAI	$0.15	$0.60
Gemini 2.5 Flash	Google	$0.30	$2.50
Gemini 3 Flash	Google	$0.50	$3.00
Claude Haiku 4.5	Anthropic	$1.00	$5.00
Gemini 2.5 Pro	Google	$1.25	$10.00
GPT-4o	OpenAI	$2.50	$10.00

Google's Gemini Flash lineup dominates the budget end. Flash-Lite 2.0 at 7.5 cents per million input tokens is essentially free. Even Gemini 3 Flash, which is significantly more capable, costs just $0.50 per million input tokens.

What 1,000 conversations cost

Using our real production data (4,270 input, 93 output per conversation):

Cost per 1,000 conversations across seven models

Model	Per conversation	Per 1K conversations	Per 10K conversations
Gemini 2.0 Flash-Lite	$0.00035	$0.35	$3.50
GPT-4o mini	$0.0007	$0.70	$7
Gemini 2.5 Flash	$0.0015	$1.50	$15
Gemini 3 Flash	$0.0024	$2.40	$24
Claude Haiku 4.5	$0.0047	$4.70	$47
Gemini 2.5 Pro	$0.0063	$6.30	$63
GPT-4o	$0.012	$12	$120

We use Gemini 3 Flash at hej.chat. At $2.40 per thousand conversations, it gives us the newest model quality from Google at a price that's basically invisible on our bill. The jump from Flash 2.5 to Flash 3 is less than a dollar per thousand conversations, but the response quality difference is noticeable.

Here's another way to think about it. What does $100 of API credits actually get you?

Number of conversations per $100 across seven models

Flash-Lite 2.0 gives you 285,000 conversations for $100. Even Gemini 3 Flash handles 41,000+ conversations for that price. GPT-4o, the most expensive model here, still delivers 8,300 conversations. These are not serious expenses for any business with a website.

Scaling to 500K conversations

At low volume, model choice barely matters. At scale, it compounds.

Monthly cost by conversation volume for four representative models

Volume	Flash-Lite 2.0	Gemini 3 Flash	Haiku 4.5	GPT-4o
10K/month	$3.50	$24	$47	$120
50K/month	$17.50	$120	$235	$600
100K/month	$35	$240	$470	$1,200
500K/month	$175	$1,200	$2,350	$6,000

At 100K conversations per month on Gemini 3 Flash, you're spending $240/month. For context, a single full-time support agent costs $3,000-5,000/month and handles maybe 2,000 conversations. The AI handling 100K conversations at $240 is doing the work of 50 agents at 0.15% of the cost.

Even at 500K monthly conversations (a high-traffic SaaS product), Flash-Lite keeps you under $200/month and Gemini 3 Flash costs $1,200. These are rounding errors for any company generating that kind of traffic.

A note on speed

With only 93 output tokens per response, model speed barely matters for this use case. Even the slowest model on our list generates 93 tokens in about 1.5 seconds. The fastest do it in under half a second. Both feel instant in a chat widget.

What does matter is time to first token (TTFT), the delay before the response starts streaming. Google's Flash models and GPT-4o mini lead here with sub-500ms TTFT. That means the user sees text appearing almost immediately after sending their message.

For a chat application, any model in the Flash or mini tier is fast enough that users won't notice a difference.

Which model should you pick

Use case	Recommended	Why
High-volume FAQ bot	Gemini 2.0 Flash-Lite	Cheapest at any scale, handles simple Q&A well
General website support	Gemini 3 Flash	Best quality/cost balance, newest Google model
E-commerce product chat	Gemini 2.5 Flash or 3 Flash	Good with product details and tool use
Complex multi-turn support	Gemini 2.5 Pro or GPT-4o	Stronger reasoning for nuanced conversations
Enterprise / compliance-heavy	Claude Haiku 4.5 or Gemini 2.5 Pro	Anthropic's safety profile or Google's enterprise support

For most businesses adding AI chat to their website, Gemini 3 Flash is the sweet spot. It's fast, cheap, capable, and handles tool use (like searching a knowledge base) reliably. That's why we use it at hej.chat.

If cost truly is a constraint, Flash-Lite 2.0 at $3.50 per 10K conversations is hard to argue with. If you need premium reasoning, Gemini 2.5 Pro or GPT-4o are there, and even they cost less than a single team lunch per thousand conversations.

The era where AI was expensive is over. A real support conversation costs a quarter of a penny. The question isn't whether you can afford to run AI. It's what's stopping you.