Skip to content
Back to Blog

Script First, AI Second: How We Run 20 Agents for $1 a Day

AgencyBoxx Team
Script First, AI Second: How We Run 20 Agents for $1 a Day

Our first attempt at building AI agents cost us $150 in four hours. That disaster led us to discover the ai agent cost optimization strategy that now powers our entire operation.

That is not a typo. One hundred and fifty dollars. In an afternoon. On a system that was supposed to save us money.

Key takeaway: Scripted logic handles 80% of agent work at zero AI cost, while tiered model routing directs the remaining 20% to the cheapest model that produces acceptable quality, cutting daily operating costs from triple digits to roughly one dollar.

The mistake was obvious in hindsight. We had handed the AI model every decision. Every API call, every data formatting step, every routing choice, every classification task. The model was brilliant at all of it. It was also wildly expensive at all of it, because we were paying premium token rates for work that a Python script could do for free.

That afternoon changed how we build everything. Twelve months later, we run 20 agents across three OpenClaw instances for about $1 a day in AI credits. Same quality of output. Same operational coverage. Ninety nine percent cheaper.

Here is the architecture that got us there.

The Expensive Lesson

When you first start building AI agents, the temptation is to let the AI do everything. It feels like magic. You describe what you want in plain English, the model figures out how to do it, and the result appears. No coding. No flowcharts. No debugging. Just vibes and tokens.

The problem is that most of what an agent does on a daily basis is not reasoning. It is plumbing.

Checking whether a team member logged time today is not an AI problem. It is a database query. Formatting a Slack message with the results is not an AI problem. It is string concatenation. Sending a reminder at 3:30 PM is not an AI problem. It is a cron job. Pulling contact information from Hunter.io, validating it with ZeroBounce, and organizing it in a structured format is not an AI problem. It is a series of API calls with error handling.

When we let the AI model handle all of that, we were paying for reasoning on tasks that required zero reasoning. Every token spent on "look up this API endpoint and format the JSON response" was a token wasted on something a ten line Python function could do faster, cheaper, and more reliably.

According to Fortune Business Insights, the agentic AI market will reach $196.6 billion by 2034, growing at 43.8% CAGR. The demand is real, but so is the cost problem. Custom AI agent builds typically cost $75,000 to $300,000 according to industry pricing research, and much of that expense comes from over-engineering the AI layer when deterministic scripts would do the job. The $150 afternoon was the cost of learning that lesson. It was worth every penny because we never made that mistake again.

The Architecture: Scripts Handle the Predictable, AI Handles the Unpredictable

Every agent we build now follows the same pattern. We start in Claude Code and build out the core functionality as deterministic Python scripts. The logic is explicit. The behavior is predictable. There are no hallucinations because there is no model involved in the mechanical parts.

Then we bring AI in surgically, only for the parts that genuinely require intelligence.

Here is how that breaks down in practice:

Scripted (zero AI cost):

  • API calls to ClickUp, Front, Gmail, Google Drive, HubSpot, and every other integration
  • Data formatting, cleaning, and transformation
  • Routing logic (which agent handles which type of request)
  • Scheduling and cron jobs (morning briefings, end of day summaries, compliance checks)
  • SLA timer calculations and escalation triggers
  • Spam classification using rule based pattern matching
  • File organization and folder management
  • Health checks and service monitoring
  • Log aggregation and audit trail generation

AI powered (costs tokens, but worth it):

  • Drafting email replies that match a specific person's writing style
  • Analyzing meeting transcripts for action items, sentiment, and risk signals
  • Generating briefings that synthesize information from multiple sources
  • Evaluating whether a prospect contact is a genuine decision maker
  • Diagnosing novel system failures that do not match known patterns
  • Content generation (blog drafts, summaries, reports)
  • Complex classification tasks where rules cannot capture the nuance

The ratio in our system is roughly 80/20. Eighty percent of what our agents do every day is scripted. Twenty percent involves an AI model. That ratio is why the daily cost dropped from "this will bankrupt us" to about a dollar.

Tiered Models: Not All AI Tasks Are Equal

Even within the 20% that requires AI, not every task needs the same model.

We run a tiered routing strategy that matches the complexity of the task to the cost of the model:

Free (local models via Ollama): Summarizing text, cleaning scraped data, chunking documents for our knowledge base, basic classification, embedding generation. We run Qwen 2.5 and nomic embed text locally on the same Mac Studio that hosts everything else. These models handle thousands of operations a day at zero marginal cost. When Qwen 3.5 dropped recently, we found it performing at roughly the level of mid tier cloud models from six months ago. That is a significant step up for a model running on local hardware for free.

Cheap (Gemini Flash): Standard classification, email triage, template driven content fills, and any task where speed matters more than nuance. Costs as low as $0.30 per million input tokens. We use this tier for high volume, moderate complexity work that needs more intelligence than a local model but does not justify a premium model.

Mid tier (Gemini Pro): Email drafts, meeting briefings, coordination across agents, and tasks that require genuine reasoning but are not client facing. This is the workhorse tier for most of the AI powered 20%.

Premium (reserved for high stakes output): Client facing email drafts, complex diagnostic analysis, and anything where getting it wrong has real consequences. Used sparingly and only after a cheaper model has already compressed and organized the context.

Here is how those tiers compare side by side:

TierExample TasksCost per Million TokensWhen to Use
Free Local (Ollama)Summarization, embedding, basic classification, data cleaning$0.00High volume, low complexity tasks that run thousands of times daily
Cheap Cloud (Gemini Flash)Email triage, template fills, standard classification$0.30Moderate complexity where local models occasionally miss
Mid-Tier (Gemini Pro)Email drafts, meeting briefings, agent coordination$1.25Genuine reasoning tasks that are not client facing
Premium (reserved)Client facing drafts, complex diagnostics, high stakes output$2.00+Only after cheaper models have compressed the context

"59% of employees will need additional training by 2030, but agencies can start automating operational overhead today." -- World Economic Forum, Future of Jobs Report

The core principle: cheap models gather and organize. Expensive models judge and create. This ai agent cost optimization philosophy is what makes the entire architecture sustainable. To understand why most AI agents do not need to be smart, they need to be cheap, look at how this tiered approach plays out across every operational domain.

Context Distillation: The Hidden Cost Multiplier

The single biggest cost driver in AI agent systems is not which model you use. It is how much context you send it.

A typical agent task might involve: the current email thread (20 messages), the client's communication history (dozens of previous interactions), relevant meeting transcripts, the client's project status from ClickUp, and any previous corrections the human made to similar drafts. If you feed all of that raw to a premium model, you are paying for the model to read through pages of context before it even starts thinking about the actual task.

Our system distills context before it reaches the expensive model. A cheap model (or a script, when the extraction is mechanical) reads all of that raw input and produces a focused brief: who the email is from, what they want, what the relevant history is, what tone previous corrections suggest, and what constraints apply. The premium model only sees that brief.

This keeps our premium model costs 70 to 95% lower than they would be if we fed everything raw. The quality of the output does not suffer because the distillation step preserves everything the model actually needs to do its job. It just strips out the noise.

If you are running agents and your token costs are climbing, look at your context sizes before you look at your model choices. Compressing input is almost always a bigger lever than downgrading models. For a deeper look at the economics, see why our 21 AI agents cost $2.50 a day.

The Alternative Architecture: AI Driven SOPs

Our approach is not the only way to do this.

We have spoken with another agency running a comparable number of agents on OpenClaw who took a fundamentally different path. Instead of scripting the deterministic logic, they built their agents around AI driven standard operating procedures. Each agent receives a detailed SOP: "this is how we handle time tracking compliance," "this is what happens after a meeting ends," "this is the process for triaging a new client email." The AI interprets and executes against those instructions, making more autonomous decisions within the boundaries of the SOP.

Their agents write daily self assessments. They document what worked, what failed, and what they would do differently. Over time, those assessments feed back into the system and the agents genuinely improve. One of their agent logs noted: "Attempted time entry creation with non existent properties without schema validation first. Should have tested single record before batch operation to catch property errors." That is a real learning loop.

The tradeoffs are clear:

The SOP approach gives you flexibility and self improvement. Agents can adapt to novel situations without new code. The self assessment loop means they get better over time. And the SOP format is accessible to non developers who can read and edit plain English instructions.

The scripted approach gives you predictability and cost control. Agents do exactly what you told them to do, every time. There are no surprise behaviors, no token spikes from an agent deciding to "think harder" about a routine task, and no risk of the model misinterpreting an SOP in a creative but wrong way.

Both approaches produce working agency operations systems. Both serve real clients. The right choice depends on your tolerance for unpredictability, your budget for AI credits, and whether you have someone on the team who can write Python.

We chose scripts because we wanted the lowest possible operating cost and the highest possible predictability. We are not against the SOP approach. We just learned the hard way that when you give AI models more decision surface, you pay for it in tokens and in the occasional surprise.

What This Means for Your Agency

If you are thinking about building AI agents for your agency, here are the practical takeaways:

Start with the scripts, not the AI. Before you involve any model, ask yourself: does this task actually require intelligence, or does it require execution? If a series of API calls and some conditional logic can handle it, script it. Save the AI for the parts where a human would need to think.

Tier your models like you tier your team. You would not assign a senior strategist to format a spreadsheet. Do not assign a premium AI model to summarize a meeting transcript. Match the cost of the model to the complexity of the task.

Compress before you spend. Every token you send to a premium model costs money. Distill your context first. A cheap model or a simple script can extract the relevant information and throw away the noise before the expensive model ever sees it.

Measure your actual costs from day one. We track token usage, cost per agent, cost per task, and cost per day across every instance. If you do not measure it, you cannot optimize it. The dashboard does not need to be fancy. A daily log that shows which agents consumed how many tokens is enough to spot the problems.

Do not be afraid of the hybrid. Some agents in our system are 95% scripted with a tiny AI component for the one step that needs reasoning. Others are more balanced. There is no rule that says every agent has to follow the same architecture. Match the approach to the task.

We put this architecture through rigorous validation: "First round of 80 tests: 97% success rate. Second round: 100%." That kind of reliability does not come from prompt engineering alone. It comes from scripting the predictable and reserving AI for the genuinely unpredictable.

The $150 afternoon felt like a disaster at the time. In retrospect, it was the most valuable four hours we spent on the entire project. It forced us to think about AI agents as an engineering problem, not a prompting problem. And that shift in thinking is the difference between a system that costs a fortune to run and one that costs a dollar a day.

To see how it works in practice, or to explore the full breakdown of our architecture, the best next step is a live walkthrough.

Frequently Asked Questions

What does "script first, AI second" mean?

It means building the deterministic, predictable parts of an AI agent as plain code (Python scripts, API calls, conditional logic, cron jobs) before introducing any language model. The AI layer only handles tasks that genuinely require reasoning, like drafting emails in a specific voice or analyzing meeting transcripts for sentiment. In our system, roughly 80% of what agents do daily is scripted with zero AI cost, and only 20% involves a model call.

How much does it cost to run 20 AI agents?

Our 20 agents across three OpenClaw instances cost approximately $1 a day in AI credits. The cost is low because the vast majority of operations are scripted (zero token cost), local models handle basic AI tasks for free, and tiered model routing ensures expensive models are only used when the output quality demands it. Context distillation further reduces premium model costs by 70 to 95%.

What is tiered model routing?

Tiered model routing matches the complexity of an AI task to the cost of the model that handles it. Free local models handle summarization and embedding. Cheap cloud models handle classification and triage. Mid tier models handle email drafts and coordination. Premium models are reserved for high stakes, client facing output. The principle is simple: cheap models gather and organize, expensive models judge and create.

Can you build AI agents without expensive models?

Yes. Our time tracking enforcement agent, which was the first agent we built and delivers the fastest ROI, uses zero AI. It is entirely scripted Python: API calls, conditional logic, and Slack notifications. Several other agents in our system are 90 to 95% scripted with only a small AI component for the one step that requires natural language understanding. Starting with scripts and adding AI only where necessary is the most cost effective approach.

AgencyBoxx runs 50+ services on dedicated hardware for a fraction of what most agencies spend on a single SaaS subscription. Book a Walkthrough to see the architecture in action.