Inside Sentinel: How Agentic AI Runs a Continuous Pentest

“Continuous pentest” gets tossed around as a marketing phrase so often that most security teams have stopped trusting it. Usually what vendors mean is “we run our scanner more often.” That’s not a pentest. A pentest is an adversary walking through your application the way a real attacker would — probing, chaining, pivoting — and producing a finding you can actually fix.

Sentinel is our answer to that gap. It’s an agentic AI PTaaS that runs four specialised agents in a loop, 24 hours a day, against scope you authorise. A human reviewer signs off before any finding ships. Below is how the architecture actually works, what the agents do, where the human boundary sits, and a walk-through of one realistic finding from start to Jira ticket.

Why continuous only works if you architect for it

Annual pentests made sense when releases went out quarterly. If you ship twice a day, a snapshot from six months ago tells you almost nothing about the attack surface that exists right now. A feature branch merges, a subdomain gets spun up for a partner demo, an S3 bucket gets re-permissioned to unblock a deploy — your report is stale.

The obvious fix is “run the pentest more often,” but you can’t keep senior testers hammering the same environment forever without drowning them in duplicate work. What you can do is split the job into pieces machines handle well — enumeration, noticing diffs, trying known exploit primitives — and pieces only humans should own, like judging whether a finding is real and how to write it up. Sentinel draws the line there.

The four-agent loop

Sentinel runs four agents. They’re not interchangeable LLM calls — each has its own toolchain, its own memory, and its own narrow job. The loop runs continuously for any scope you authorise.

Recon agent

The recon agent owns discovery. It pulls subdomains from cert transparency logs, resolves them with massdns, fingerprints services, cross-references against the last known state of your surface, and flags what’s new. If you add staging-api-v2.example.com at 2am, recon sees it before your on-call does. It uses subfinder and amass for breadth, then narrows with targeted probing. The output isn’t a flat list — it’s a graph of assets, relationships, and change timestamps.

The important detail is memory. The recon agent remembers what your surface looked like yesterday, last week, last quarter. A host that’s been stable for two years matters less than one that appeared forty minutes ago, and the agent weights its attention accordingly.

Exploit agent

The exploit agent takes recon’s graph and tries to prove impact. It runs Nuclei templates where they fit, Semgrep for code it can reach, and a library of payload chains for classes of bugs that no single template catches — auth bypasses, SSRF to metadata, deserialisation chains, broken access control. It plans. It tries chain A, fails, reasons about why, tries chain B.

This agent is deliberately scoped. It won’t run exploits that could cause damage or data loss, and it stops at the first proof-of-concept step. If it finds an injection that could dump a database, it stops at a harmless canary row, not the full table. That scope is enforced in the agent’s tool definitions, not just its prompt.

Validate agent

The validate agent’s entire job is “did that actually work, and is it reproducible?” It takes the exploit agent’s claimed finding and re-runs the proof from a clean state, different IP, different session. False positives are where most AI security tools die. If the bug only reproduces once out of ten tries, it’s probably a race condition or a flake, and the validate agent downgrades it until we can confirm.

Validation also writes the reproduction steps. Not “we sent a POST request” — the exact curl command, the exact response, the exact line in the code (if we have source access via GitHub integration) that’s responsible.

Report agent

The report agent converts the validated finding into something a developer can act on. Severity assessment, business-impact framing, remediation guidance tied to the specific framework the client uses, and a CERT-In-aligned summary block for Indian clients who need one for disclosure.

The report agent does not ship the report. That’s the boundary.

A walk-through: from new subdomain to filed ticket

Here’s a realistic synthetic example. Names are made up; the shape of the finding is one we see often.

Monday, 3:14am. Recon notices test-admin.partner.example.com has resolved for the first time. Cert transparency picked up a new certificate issued forty minutes earlier. Recon fingerprints the host — it’s running a familiar admin panel behind basic auth.

Monday, 3:19am. Exploit agent gets the handoff. It tries default credential pairs for that admin panel — the panel’s documentation ships with admin/admin as the out-of-box default and a surprising number of deployments never change it. The pair works. The agent now has an authenticated session.

Monday, 3:22am. Inside the panel, exploit finds a file-read feature meant for log inspection. It reads /etc/passwd as a proof, stops there, and hands off.

Monday, 3:25am. Validate agent re-runs the chain from a new egress IP. Same result. It captures the full reproduction — DNS resolution, basic auth handshake, session cookie, the file-read request. It tags the finding as a chain: exposed admin panel + default credentials + authenticated file read. On its own, default creds on a test box might be a medium. Combined, it’s a high.

Monday, 6:30am. A human reviewer on our side picks up the finding from the queue. They check the reproduction works, sanity-check that the host is actually in scope (sometimes test subdomains point at partner infrastructure we’re not authorised to touch), confirm severity, and approve.

Monday, 6:47am. The report lands in the client’s Jira with reproduction steps, fix guidance (rotate creds, put the admin panel behind VPN, disable the file-read feature), and a Slack notification to the security channel. Under four hours from subdomain appearing in cert transparency to ticket with remediation.

Without the loop, that subdomain would have sat there until the next quarterly review.

Where the human-in-the-loop line sits

We’re opinionated about what AI should and shouldn’t decide on its own.

AI decides: what to enumerate, what to probe, what exploit chains to try, how to write a reproduction, what severity to suggest, how to draft remediation text.

Humans decide: whether a finding is real, whether it’s in scope, whether the severity is right, whether anything ships. A reviewer also has the authority to send a finding back to the agents with notes — “this doesn’t reproduce on our side, retry with these constraints” — and the agents treat that feedback as signal for next time.

The reason is simple. Agentic systems hallucinate, and they misjudge severity in ways that are hard to catch if you’re not looking. A confident-sounding writeup of a non-issue wastes a developer’s afternoon and burns credibility. Human validation is the backstop that lets us claim zero false positives on what ships — not on what the agents generate, but on what lands in your queue.

Integrations that match how AppSec teams actually work

Findings don’t help if they sit in a portal nobody checks. Sentinel ships to where your team already is. Jira for ticketing, with custom field mapping so severity, CWE, and component route correctly. GitHub for issues tied to pull requests, with the affected file and line range where we can source-map it. Slack for real-time alerts on criticals, muted channels for lower severities.

For Indian regulated clients, the report agent produces a CERT-In-aligned summary block that can drop into a 6-hour disclosure notice if a finding crosses the reporting threshold. Not a feature you think you need until you’re scrambling to meet the window.

Why zero-false-positive reporting is the point

An AppSec team’s ability to trust its tooling is a finite resource. Every time they chase a finding that turns out to be nothing, that trust ticks down. After enough false positives, they stop reading reports carefully, and the real findings get missed alongside the noise.

Sentinel runs hot on the agent side — we’d rather the exploit agent try a hundred chains and produce twenty candidate findings than play safe and miss one. But nothing ships to you without a human confirming it. The reviewer is the filter that makes the agentic hot-fire useful instead of exhausting.

That’s the design principle. Let the machines look for bugs continuously. Let humans decide what’s worth your team’s Monday morning.