Agents make pretty darn good beta testers

Using coding agents to pressure-test developer tools across hundreds of real-world repositories

I recently built a JSON schema validation CLI. There are plenty of schema validation engines out there, of course, but I wanted a particular sort of dev tool ergonomics layer that I couldn’t seem to find. Before the era of agentic engineering, it was often hard to justify building utilities of this sort, but now that they can be cranked out in a matter of hours, my agent and I got to work. Implementing the core logic, command line interface, and configuration structure didn’t take all that long. I added thorough unit test coverage and fixed the obvious bugs. At this point the tool was usable, but passing unit tests of course wasn’t enough — I wanted to know whether it actually worked well in real-world contexts. Historically, this might be the point where we’d bring in human beta testers. But now…

We’ve got a new option: Agents. Agents aren’t full replacements for actual human users, of course. But the underlying language models have plentiful real-world feedback in their training data, and they can at least serve to a degree as simulators. So, I asked an agent to pick a public repository, clone it, run my tool against it, and see whether the tool output seemed correct and reasonable. Rough edges quickly become clear, and I fixed them. That was remarkably useful, I thought — so I did it with another repo, and another. Then I figured we could do a batch of ten repos. Actually, why not a hundred?!

Well, a hundred turned out to be pushing it. At that scale, operations started timing out and the agent had more issues to reason over at once than it could meaningfully handle. Nothing some additional infrastructure couldn’t fix, though. My perspective is that most codebases deserve dedicated, contextually designed tool servers, and this one was no exception. The trick with agentically orchestrating complex workflows is deciding what you want to leave flexible and what’s better made mechanical. In this case, there was no particular reason the agent needed to figure out again and again how to efficiently and reliably clone large numbers of public repos, nor was it necessary for all repos to finish cloning before starting work on the smaller ones. So, once the agent chose the repos it wanted to test against, the tool server starts cloning them in parallel and provides just the first one to work with as soon as it’s ready. When the agent reports that it’s done with the first, the tool gives it the next one. In this way, the tool server becomes an analog of a graphical interface wizard: start here, make this decision, take this action next, etc.

Another issue agents run into at this sort of scale is forgetfulness — by which I mean context window compaction. Fortunately, data persistence is another thing that classic computer programs are great at. So, when the agent reports its product feedback for a repo, the tool persists this data to disk — which is useful not only for future agentic use, but also for human review. Then, at the end of the process, when all repos (tens! hundreds!) have been considered, there’s a tool to recall all the previous feedback, which the agent can consider in aggregate and synthesize prioritized recommendations from.

I kept improving the toolset, which eventually grew to include actions for discovering candidate repos, checking preparation status, querying tool output, and more. And with this tooling at its disposal, my agent cranked through countless real-world repositories, leading to a laundry list of product improvements: Flexible JSON parsing, including comments and trailing commas; parse-error hints for templated YAML; support for vscode: schema references; UTF-8 BOM handling; oversized enum summarization; SchemaStore catalog matching corrections; hints for missing local schemas, such as dependency-provided schemas under node_modules; and more.

So far I’d been running all of this locally, but product improvement is never done, so I really wanted to run it repeatedly, ideally not on my dev box. That’s where an agent host like Agentic Workflows comes in. Without being there to keep an eye on things, I want strong guardrails and boundaries on execution, like control over network access, restrictions on shell use, and limits on runtime. Additionally, now that I’m not watching the testing run, I need someplace to see the results. There’s a bunch of options here, but I decided to open a pull request with the persisted run results, and a discussion with the product recommendations. While I’ve agreed with the vast majority of suggestions from the process so far, I very much want the ability to review, consider, and tweak the changes before making them, especially as the product issues become more subtle. If I agree with the recommendations, I can assign them as follow-up work items.

It might seem odd at first to employ coding agents as product testers — after all, they’re not really getting value from the product in the way actual end users are intended to. But in a sense, agentic product testing is a natural extension of agentic code review: Maybe the agent spots a bug from static analysis (looking at the code), or maybe it spots it from dynamic analysis (actually running the system), or perhaps even both! As long as the issues are found (and fixed), I’m not sure it matters terribly much. I don’t think we’re anywhere close to having discovered all the workflows that can be agentically automated, and I’m sure robotic beta testing is just one of many workflows to explore. If you want to try this technique for yourself, you’re welcome to use the source code as a starting point, and feel free to reach out!