Why One Big AI Skill Is Slower Than Many Small Ones
Scope note: This post has one specific focus — comparing discrete small skills run sequentially against a single large skill that performs the same steps, even if that large skill internally delegates to the smaller ones. Everything else — orchestration overhead, output quality, model behaviour differences, cost-per-token — is treated as a constant and left out of scope. To study one variable, you have to hold the others still.
The Experiment
I had a workflow that involved several distinct steps. Think of it as something a lot of teams deal with: a user provides a brief — a topic, some notes, maybe a link — and the system needs to go off and produce a finished CMS page.
The steps looked roughly like this:
- Fetch the requirements from a ticket tracker
- Pull a design preview (an image) from a design tool
- Generate and create the page in the CMS
- Verify the page looks correct
- Add relevant tags and analytics
Each step was built as its own Claude Code skill — a focused, short prompt that knew exactly what it was doing. Running them one after the other interactively took 10 to 15 minutes from start to finish.
Then I did what felt like the obvious next step: I combined them into a single “master skill” — one prompt to rule them all. It called each sub-step in sequence. Same logic, same tools, same model.
It took up to an hour.
Why Does a Single Big Skill Take So Much Longer?
The answer comes down to what you are asking the LLM to do, and how much it has to carry in its head while doing it.
1. The LLM Becomes Both Planner and Executor
When you run five small skills manually, you — the human — are the orchestrator. You decide when step one is done, hand the result to step two, and so on. Each skill starts with a clean slate and a focused job.
When you wrap everything in one big skill, the LLM takes over that orchestration role. It now has to:
- Keep track of what has been done
- Decide what to do next
- Carry all prior context as it proceeds
- Recover gracefully if something earlier was ambiguous
That is a fundamentally harder job, and it degrades as the workflow grows.
2. Context Window Pollution
Every skill step produces output — text, structured data, confirmation messages. In a chained single-skill workflow, all of that accumulates in the same context window.
By the time the LLM reaches step four, it is wading through the outputs of steps one, two, and three before it can get to work. Worse, if step two produced a large chunk of data (a design image description, a long page draft), that bloats the context for every subsequent step.
The practical effect: the model gets slower, more hesitant, and more prone to re-reasoning things it already decided.
3. Error Recovery Becomes Expensive
In a chain of small skills, a failure in step three is isolated. You fix it and re-run from step three. In a single mega-skill, the LLM has to decide what went wrong, backtrack internally, and try again — all within the same context, all while remembering the full state of everything else.
The Math of It
Let’s say each step has a base context of \(C\) tokens when run in isolation — the skill prompt plus the inputs it needs. In the single-skill approach, by the time the LLM reaches step \(k\), the effective context has grown to:
\[C_{k} = C + \sum_{i=1}^{k-1} O_i\]Where \(O_i\) is the output size of step \(i\). The base context \(C\) stays fixed (it is the same mega-skill prompt throughout), but every prior step’s output accumulates on top of it. The model’s reasoning cost grows with each step because inference time scales super-linearly with context length for transformer architectures.
In the distributed skill approach, each of the \(n\) steps runs with a context close to \(C\), independent of which step it is. Total wall-clock time is roughly:
\[T_{\text{small}} \approx n \cdot t(C)\]Versus:
\[T_{\text{large}} \approx \sum_{k=1}^{n} t\!\left(C + \sum_{i=1}^{k-1} O_i\right)\]Since \(t(x)\) is superlinear and every term in \(T_{\text{large}}\) is at least as large as \(t(C)\) — and grows larger with each step — the large-skill path loses badly as \(n\) increases.
The chart below plots cumulative execution time across a 5-step workflow, assuming \(t(x) \propto x^2\) (a reasonable model for transformer attention cost) and each step’s output roughly equalling the base context \(C\):
Cumulative Execution Time: Small Skills vs Single Mega-Skill
The red line (theoretical O(n²)) reaches 55 normalised units by step 5 versus 5 for small skills — an 11× gap on paper.
The real-world observation, however, was a 4–6× gap (10–15 minutes versus up to an hour). The model overshoots. Why?
Modern LLM inference providers do not expose raw quadratic attention to every request. KV caching means that once a token has been processed, its key-value pairs are stored and reused — so per-token generation cost scales closer to O(n) rather than O(n²). FlashAttention [2] reduces memory bandwidth pressure further. Batching, speculative decoding, and other infrastructure-level optimisations compress the wall-clock time even more.
The orange line models this: with O(n) cost per step, the mega-skill accumulates 15 units by step 5 — a 3× gap. The observed 4–6× sits between the two curves, which is exactly what you would expect from an infrastructure that partially — but not fully — absorbs the quadratic penalty.
The optimisations reduce the cost. They do not eliminate it. A 4–6× real-world gap on a five-step workflow is not a rounding error. It is the difference between a tool you reach for and a process you schedule.
Small Skills Win Because Boundaries Force Clarity
There is another reason beyond pure performance: constraint produces clarity.
A small skill is forced to have a defined input and a defined output. The model cannot hedge. It cannot defer a decision to a later step. It must produce something usable and hand it over.
That handover boundary is where a lot of hidden value lives. It forces the workflow designer to think clearly about:
- What does this step need to know?
- What should it produce?
- Who decides whether it succeeded?
When you collapse all of this into one skill, those boundaries dissolve. The LLM starts improvising the handover logic, which leads to inconsistency, unexpected paths, and the slow, sprawling runs that consume an hour of compute.
The Deeper Problem: LLMs Are Poor Orchestrators
LLMs are excellent at focused reasoning, code generation, summarisation, and decision-making within a bounded scope. They are poor at:
- Long-horizon state management
- Reliable multi-step self-orchestration
- Consistent recovery from partial failure
Asking a single LLM context to manage a five-step workflow is like asking a surgeon to also manage the operating theatre schedule, order supplies, and file insurance claims — all while performing the surgery. The expertise is there for the surgery. Everything else degrades the surgery.
The better model is this: the workflow decides what runs next. The LLM executes the step it is handed.
This Is Exactly What FlowDrop Is Built For
Disclosure: I am the founder of FlowDrop. Make of that what you will — but the observation that prompted this post came before the product, not the other way around.
This is not an abstract principle. It is the concrete motivation behind FlowDrop.
FlowDrop is a visual workflow editor where you define the graph, and the workflow engine is what decides sequencing, branching, and handover. Each node in the graph can invoke an AI skill, a script, an API call, or a human review step. The LLM (or whatever sits inside each node) receives only what it needs, produces its output, and the engine moves the state forward.
In this model:
- Each step has a bounded, predictable context
- Failure in one step is isolated and restartable
- The workflow is inspectable, auditable, and reconfigurable without touching the skills themselves
There is one more cost the model above does not capture: the human in the loop. When skills are chained manually, someone has to watch for each step to finish, decide it looks good, and trigger the next one. That gap — however small per step — adds up across a workflow and introduces variability that no amount of prompt optimisation can fix.
FlowDrop closes that gap. The engine handles the handover. Skills run as fast as they can, in the right order, without waiting on anyone. The human stays in the loop where it matters — reviewing outcomes, not babysitting transitions.
The result is not just faster execution. It is more predictable execution, with human attention spent where it actually adds value.
What This Means in Practice
If you are building AI-assisted workflows with tools like Claude Code skills, n8n, LangGraph, or anything similar, the structural lesson is the same:
- Compose small, focused units rather than building one large prompt that tries to do everything
- Keep the orchestration logic outside the LLM — in a workflow engine, a script, or your own code
- Define clear handover contracts between steps: what goes in, what comes out, what success looks like
- Treat the LLM context window as a scarce, expensive resource — do not pollute it with context that does not serve the current step
The instinct to “just put it all in one prompt” is understandable. It feels simpler. But the performance, reliability, and maintainability penalties compound quickly.
Small and sharp beats big and blunt. Every time.
References
-
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention Is All You Need. arXiv:1706.03762. — Establishes the O(n²) time and memory complexity of self-attention over a sequence of length n, the basis for the superlinear cost model used in this post. https://arxiv.org/abs/1706.03762
-
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. arXiv:2205.14135. — Addresses the quadratic memory and compute bottleneck of standard attention; confirms that context length growth is the dominant cost driver in transformer inference. https://arxiv.org/abs/2205.14135
-
Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172. — Empirically shows that model answer quality degrades when relevant information is buried in the middle of a long context (as opposed to near the start or end). This is a quality effect rather than a speed effect, and is noted here as a complementary cost of context window growth — separate from, but compounding, the latency penalty this post focuses on. https://arxiv.org/abs/2307.03172
-
Schluntz, E., & Zhang, B. (2024, December 19). Building effective agents. Anthropic. — Anthropic’s practical guide to agent design patterns, including the principle of keeping orchestration logic outside the model. https://www.anthropic.com/research/building-effective-agents