Tell your agents to test their own work

AI coding agents seem remarkably bad at figuring out when they’re done; they’ll gladly claim success when in actuality the code doesn’t build, the app doesn’t boot, and there’s formatting issues everywhere. Perhaps it’s because of the sycophancy we reinforce during model training — they really want to tell us what we want to hear, as soon as possible. Or perhaps it’s because the models are simulating human engineers, who are also notorious for saying they’re done when they’re not…

Regardless, to be useful, we need our coding agents to keep working until they’ve in fact completed the task at hand. And, since we can’t just take their word (token stream) for it, we need something else to hold the agent accountable — ideally, something that will automatically pass or fail, depending on whether expectations are met. I think that something is, of course, tests.

Not just any tests, though. You probably (hopefully?) already have tests in your codebase that exercise the application code. Those are relevant here as well, but for agentic work, the system to test is the codebase itself: If a file was supposed to be renamed, we can test that the new path exists; if a new script was supposed to be implemented, we can ensure that script runs; and so on.

The coding agent can write the tests before it begins, and show that they fail — yes, that’s test-driven development for agentic tasks. Because these tests are specific to the task, and because the task is ephemeral (once it’s done, we forget about it), the tests too are ephemeral; we can exclude them from source control and leave them behind. If we want to set global expectations across tasks, we can commit those as a durable suite.

These tests could be run in standard runners like Jest. That’s where I started when exploring this idea. For example:

describe("README.md", () => {
  it("should exist", async () => {
    const readmePath = path.join(import.meta.dir, "README.md");
    const stats = await stat(readmePath);
    expect(stats.isFile()).toBe(true);
  });
});

This is reasonable, but tokens are expensive, so I wrapped the test API into something briefer:

verify.file("README.md").exists();

Under the hood, it was still registering tests, becoming a sort of test authoring meta-API. This worked ok, but the test framework primitives made authoring new verification APIs a bit strange — particularly the need to invent new types to submit to matchers in order to get detailed errors. So I ripped out the test framework dependency and swapped in a purpose-built runner. Enter verify-repo.

You can check out the code on GitHub. It supports checks for directories, commands, formatting, source control, and more. Just like with the test API, calling verify doesn’t run the verification, only register it; this lets us write concise, synchronous test code up front, then orchestrate the actual verification inside the engine.

verify.dir("temp").not.exists();
verify.script("dev").outputs("http://localhost:5173");
verify.git.isClean();
verify.prettier.isFormatted();

The basic workflow is to ask your coding agent to author and run a task.verify.ts script before starting its work. Depending on your agent, you might be able to accomplish this via project-level instructions; else, you can request directly in the prompt. The library comes with a command to output all available verifications; this lets the agent know what it’s working with. After the initial run, the agent will then see the issues it needs to address and can continue working until all errors have disappeared — and you can use the verification suite as part of reviewing the code changes when they’re ready.

We shouldn’t expect coding agents to be perfect — they’re emulating us, after all. However, we can set them up for success with the right context, tools, and guardrails. If you’re tired of your coding agents calling things good when they’re most definitely not, try adding some repo tests (and let me know how it goes!).

A note on building this prototype: Software development in 2025 is wild. The value of generative AI technologies for information workers in general may remain unclear, but the value for devs is unambiguous. I had the initial idea for this project on Friday morning, and I designed most of the core API in a voice conversation with ChatGPT while walking to an appointment. Before heading home, from my phone, I exported a design summary from ChatGPT, created a Github repo, and handed the design to a Cursor cloud agent to get things started. Over the next couple days, I iterated asynchronously, frequently leaving Cursor to work independently, hopping back and forth between this project and making a birthday gift for my brother. Along the way, I learned things were possible in JavaScript that I’d never done before — in particular, keeping the API concise, extensible, and safe ended up involving substantial use of proxies. The library is very much a prototype and not deeply tested, but I figured it was far enough along to toss out there and see if it resonated with others. So now, less than three days later, there’s a rocky alpha on npm. I suspect this will all feel normal at some point, but for now… like I said, wild. (I did write all the prose in this post by hand, though — I still believe some things are best kept human.)