DIY AI Product Security: Buy vs Build - A Task Is Not a System

A Task Is Not a System - an agent writes one PR; a program runs the whole loop, on every repo

Prompting an agent to fix a vulnerability is a task, running an AppSec program across hundreds of repos (or more) is a system, and the gap between them is the entire product.

The best question I get on a demo call now is a challenge, and a fair one. A security leader shares their screen, points an AI coding agent at one of their repos, and a minute later there's a clean remediation PR sitting in GitHub. It builds, it reads well, it closes a real finding. Then they look back at me and ask the obvious thing: "We already do this ourselves. Our scanners flag the issue, we prompt our agent, it opens the PR, we merge it. Why would we need you?"

It's the smartest objection in the category right now, and the demo they just ran is completely real. So I leave the demo alone. The word worth questioning is "this."

Here's what that demo actually proves. The moment you reach for an AI agent to triage findings and write fixes, you've already settled the biggest question in AppSec: AI agents are how product security gets done now. We agree completely. Nullify runs on the same frontier models you're prompting. The model was never the disagreement.

So the question was never "AI or not." It's buy vs. build. Do you want to build and operate an AI security system yourself, or run one that's been engineered for exactly that, every day, for three years?

The data on building these systems in-house is sobering. MIT's 2025 State of AI in Business study found that 95% of enterprise GenAI pilots deliver no measurable impact to the bottom line, and that tools bought from external vendors succeed roughly twice as often as the ones built internally (MIT / Fortune, 2025). Wiring it up yourself is, statistically, the path most likely to stall.

A Copilot Is Not a Workforce

What you built is a fantastic task-doer. Point it at a repo, hand it a finding, get a PR. That task works. But a task is not a system, and a copilot is not a workforce.

The distance between "the agent can write a remediation PR" and "an AI security organization that runs unattended across a thousand repos, deciding what's exploitable, validating the fix, ticketing it, chasing it to merge, all at a forecastable cost" is the entire product. It's the difference between a power tool and a crew that shows up every morning.

Let me walk the exact pipeline a prospect described, because the gaps don't live in the steps. They live in the spaces between them.

Walking the DIY Pipeline

The workflow: scanners surface findings, you prompt the agent, it opens a PR, a human reviews and merges. Here's what it misses, step by step.

Before the PR, what to fix. Your scanners hand over raw findings, and the agent writes a PR for whatever it's pointed at: the unreachable finding, the build-time-only one, the CVE that's critical on paper but unexploitable in your cloud. Nothing in the pipeline asks "does this actually matter here?" So you pay tokens to generate the fix, then your engineers' afternoons to review the noise. In a recent enterprise POV, a library with four critical and ten high CVEs triaged out as negligible because it never ran in production. A DIY pipeline writes four PRs for it anyway.

The PR itself, is the fix real? Does it build? Pass tests? Quietly break an API three services depend on? The validation that should happen before a PR is surfaced instead lands on your developer, after the fact, on every single one.

Across findings, what first? The same vulnerable dependency in forty repos is forty prompts and forty PRs, with no sense of which actually sit on an internet-facing service. No correlation, no dedup, no prioritization by risk. The pipeline fixes things in whatever order you prompt them.

After the PR, the last mile. The agent opens the PR and stops. The ticket, the Slack nudge, the follow-up next week, the chase to merge: all hand-work. That last mile is most of the job of running AppSec, and the pipeline doesn't touch it.

The ceiling, what never even becomes a finding. You can only remediate what your scanners detect. App-contextual secrets, missing-authorization logic flaws, language-toolchain vulns: if your scanners miss them, the agent never gets them to fix, and they sit in production silently. The agent's ceiling is bolted to the floor of whatever you already had.

It's Never Just the Agent

Here's the part the clean demo leaves out. Almost nobody points a raw agent at a repo with no other context. You've wired it into the tools you already run: your SAST scanner, your dependency and software-composition scanning, your cloud-security and runtime tooling, your secrets scanning, maybe an ASPM aggregator on top. The agent writes fixes from that combined context.

That's a smart instinct, and the upside is real: more context is better context. An agent that sees a code finding, a reachability graph, and cloud exposure at once writes a better fix than one staring at a single scanner.

Now the cost, which is where it stops being free.

Every tool you stitch in is another integration you own, each with its own API, data model, and severity scale, all drifting on the vendor's schedule, not yours. You're no longer maintaining one harness; you're maintaining one per tool, plus the glue between them.

The same vulnerability shows up three times, in three tools, with three IDs and three severities, and reconciling that into one truth is its own hard product. An entire category of tooling exists mostly to do that correlation. Until you build it, your agent is handed conflicting signals and guesses which to believe. Feed it a scanner running at a 60% false-positive rate and you get a confident agent writing fixes for noise.

And the bill underneath the demo was never the token bill. It's your scanner licenses, the compute to host them, the engineers who keep the integrations alive, plus the tokens on top. That last line is the volatile one, and it moves fast: Uber reportedly burned through its entire 2026 AI-coding budget in four months (Forbes, 2026), and one startup watched its annual AI bill jump from $400K to $1.4M overnight after it crossed a vendor pricing threshold (Pylon, via LinkedIn). The agent is the cheap, visible layer on a five-to-seven-figure stack you stopped noticing years ago.

It's the same buy-vs-build choice, multiplied by every tool in your stack. A bought system collapses it: Nullify runs detection (SAST, SCA, secrets, pentesting), reachability, cloud context, and the asset graph as one platform on one data model, so there's nothing to reconcile and one source of truth instead of six consoles disagreeing.

Where Do-It-Yourself Genuinely Works

Let me be straight, because credibility is the point. For a single repo, a skilled engineer, and an afternoon, a DIY agent writes an excellent fix, and that demo is genuinely impressive, which is exactly why it's seductive. For a one-off or a small surface, the build path is perfectly reasonable.

The agent can write the fix; we use the same models it does. The harder undertaking is operating the whole system, across the estate, every day, reliably, at a cost you can put in a spreadsheet. The demo is the easy 20%. The system is the 80% that doesn't fit on a screen.

The Part You Can't See Is the Part You're Buying

That 80% has a name. It's the harness: the routing, validation, self-healing, dedup, prioritization, and connective tissue around the model, kept alive as the models and the attacks change every month. It's also where most of the cost and most of the failure lives. I will cover that in Part 2.

For now, one ask: if you're already running agents against your findings, make buy vs. build a deliberate decision, because right now the demo is making it for you. The POV is free and runs alongside what you've built. If a harness your team maintains still beats one we've spent three years on, across your whole estate at a forecastable cost, no harm done, and you'll have learned something useful.

Next, in Part 2, The Harness Is the Product: the routing, validation, and self-healing around the model is the hard, volatile 80%, and it's the part that quietly turns your security team into a security-tooling company.

Written by: Bryan Taylor

Sources

Book a demo