At YardWork we implement AI agent systems for other businesses. To show that agent workflows produce real output — not just demos — we built and operate a fully automated content pipeline that has published 64 bilingual blog post pairs in English and German without a writer, editor, or content manager on staff.
The situation
We implement agent systems for other businesses, so our content operation has to demonstrate the model rather than just describe it. A service business that publishes sporadically is implicitly arguing against its own pitch.
Publishing content was the obvious proof point — and the obvious constraint. But the goal was not just to rank on Google. The actual target was AI assistants: when a founder asks Claude or ChatGPT "which companies implement AI agent systems for small businesses," we want to be in the answer. That channel — generative engine optimisation, or GEO — works differently from traditional SEO. AI assistants pull individual sentences, not pages. They cite sources that have clear entities, explicit claims, and sentences that hold their meaning in isolation. Content that is not written for extraction does not get cited.
Writing a well-researched post in English takes the better part of a day. A bilingual version — researched, written, then adapted into a different register for a German business audience — roughly doubles that. At that pace we could manage three or four post pairs a month while running a business — not enough to build any real presence. We needed a pipeline that could produce at scale and was designed from the start to be cited, not just ranked.
The research layer
The pipeline starts before any writing happens. A content research API queries nine sources per topic seed: Google autocomplete, People Also Ask, Google Trends, SERP results, Reddit, Hacker News, Wikipedia, the Google Knowledge Graph, and an LLM probe.
For Wave 4 of our content plan, we ran 35+ seeds through the system in a single session. What came back was a full strategy document: six clusters organised by signal strength, execution order across clusters, near-duplicate warnings to avoid cannibalising existing posts, and a short list of seeds to cut because the signal wasn't there. Topic selection stopped being intuitive and became evidence-based.
Two of the nine sources are specifically for GEO. The Knowledge Graph signal tells us whether a topic has an established entity — if Google has defined it, AI assistants are more likely to have been trained on it and to cite it confidently. The LLM probe assesses whether a topic is the kind of question AI assistants actually answer, and whether existing content covers it well enough that a new post has a realistic chance of being cited over what already ranks.
We review and approve the strategy. The research, the scoring, and the cluster prioritisation are produced by the system.
The strategy is generated by the agent, not for the agent. We approve direction — not the analysis.
The production layer
Each post runs through a five-phase skill built in Claude Code. What distinguishes it from a writing tool is the sequence of forcing functions before any prose exists.
The brief comes before any writing, and it has a rejection condition. The skill answers six questions: who specifically the post is for, the single takeaway, the counterintuitive argument, a Pullquote candidate, a Callout candidate, and the strongest concrete claim. The argument question has a test: would a reader who just saw the title already know this? If yes, the brief gets reworked — it means the post has a description, not an argument. Weak posts get caught here rather than after four hours of writing.
Writing runs against a section map. Every section is defined before it is written: a heading that names its subject and a one-sentence job description. If a section's job cannot be named, the section does not exist yet. The intro is written last, after the full article exists — because the intro's job is to frame what the reader is about to get, and that cannot be known until the article is finished.
The quality gate runs eight separate passes. Clarity, Voice, So what, Prove it, Specificity, Heightened emotion, Zero risk, and AI extractability — each is a full re-read with a single question. They run sequentially because combining sweeps produces a compromise pass that catches neither thing well. The eighth sweep is the GEO pass: it reads every sentence in isolation and asks whether the claim holds without the surrounding paragraph. AI assistants do not cite pages — they cite sentences. A sentence that opens with "This means..." or names its subject as "it" or "they" is ambiguous when extracted. The sweep rewrites every sentence that fails until it names its subject explicitly, preserves its condition, and makes its claim complete without context.
The German version runs in the same session. Formal Sie throughout. Tool names stay in English. Frontmatter is localised — read time format, CTA label. A formal Sie scan runs before the file is saved, checking for du, dein, dich, and a dozen other informal forms. Our first batch without this check produced three posts with informal pronouns. The scan was added immediately.
The brief gets rejected before prose begins — not after four hours of writing.
What changed
After going live — the initial build took one to two weeks — here is what the pipeline changed:
- Volume that wasn't possible before. 64 bilingual post pairs published. Orbit Media's annual survey of 1,000+ bloggers puts the average at 3 hours 48 minutes per post. A bilingual pair at that rate is 7–8 hours of writing work. 64 pairs is north of 450 hours — an eleven-week writing sprint at full-time pace, running in the background while we run the business.
- Consistent quality regardless of capacity. Every post passes the same eight-sweep gate before it ships. The quality of any given post no longer depends on how much time was available that week.
- Built for AI citation, not just search ranking. Every post is written to pass the AI extractability sweep — sentences that name their subject, preserve their condition, and hold their meaning in isolation. The aim is to be in the answer when someone asks Claude or ChatGPT about AI agents for small businesses, not just to rank on page one.
- Research happens before writing, not instead of it. Topic selection used to be intuitive. Now every topic has a signal tier — PAA count, trend average, SERP competition, HN volume — before a brief is written. Wave 4's entire 38-post plan was validated in a single research session.
- Content work stopped competing with client work. The hours that would have gone into writing are available for implementation instead.
What this took
The initial build took one to two weeks. The pipeline has been running and improving since — new sweeps added, signals refined, edge cases fixed as they appear.
The research layer came first: integrating the content API, running seed batches, and learning which signal combinations predicted good topics versus noisy ones. We drafted the first cluster plan manually, then rebuilt it using the API output — the difference in signal quality made the manual approach unacceptable by comparison.
The production skill came next. The brief framework went through several iterations before the argument-sharpness test was tight enough to reliably catch weak posts before writing. Early drafts passed the brief but failed the So what sweep — which meant the brief questions needed to be harder, not the writing phase longer.
The eighth sweep — AI extractability — was added after we noticed that well-written sentences still collapsed when pulled out of context. A sentence that opened with "This means..." or "It handles..." parsed fine in the paragraph but was useless as a standalone citation. We added the sweep to the gate and applied it retroactively to earlier posts.
The formal Sie scan for German was added after our first batch shipped with informal pronouns. The fix took ten minutes. The scan has caught eleven instances across subsequent batches.
Where we stay in the loop
Two explicit approval gates exist in the workflow.
The first is after the research layer: we review the cluster strategy and execution order before any posts are commissioned. Topics can be cut, reordered, or redirected to a different angle.
The second is after the brief: the six-question brief and section map are shown before writing starts. The brief must be confirmed — not just reviewed. If the argument is not right, it gets reworked before the writing phase begins.
Everything between those gates runs without interruption.
What the system does not do
The agent does not decide what we should be known for. Brand positioning, the choice of which verticals to target, and the decision to write about agent systems for SMBs rather than enterprise — those are our decisions, made before the pipeline starts.
The system does not publish autonomously either. The publish checklist is a human step. Every post is reviewed before it goes live.
Where we are now
64 post pairs are live — 128 MDX files in English and German. Wave 4 is in progress: 38 posts across six topic clusters, all demand-validated, all waiting on the pipeline.
The same pipeline that runs our own content operation is the one we build for clients.