Deduping Bug Bounty at Scale: How Hive Clusters Submissions

Ask anyone who’s run a bug bounty program for longer than six months what the single biggest operational headache is, and the answer will almost always be the same word: duplicates. Researchers complain that their valid bugs get closed as dupes of reports they’ve never seen. Triagers complain that they spent two hours reproducing a finding only to realise it was the same issue reported from a slightly different angle last Tuesday. Program owners look at the median-time-to-triage number and wince.

Hive, our bug-bounty platform, treats deduplication as the core engineering problem. Everything else — payouts, policy templates, regional onboarding — sits on top of a dedup pipeline that has to get the judgement calls right most of the time and hand off cleanly to a human when it can’t.

Why dedup is the hard part

At small scale, a human triager reads every submission and remembers what they’ve seen. That works for a program getting three reports a week. It stops working around thirty a week, and it falls apart completely past a few hundred.

The naive solution is string matching. Two reports mention the same URL and the same parameter, so they’re duplicates. This fails constantly in both directions. Researchers describe the same bug with different words, test on different endpoints that share the same backend, or submit the same XSS with a different payload. And unrelated bugs on the same URL — an XSS and an IDOR, say — get falsely merged.

The next step up is signature matching: hash the affected endpoint plus CWE plus a normalised PoC. Better, still brittle. Bugs that exist because of a shared library affect multiple endpoints; a signature-per-endpoint treats them as separate bugs when they’re one underlying issue with one fix.

What you actually want is semantic matching — does this report describe the same underlying defect as that one, even if the surface details differ.

How semantic clustering works in Hive

When a submission lands, Hive extracts three signals: the free-text description, the PoC code or reproduction steps, and the affected URLs and parameters. Each signal gets embedded into a vector space using a model tuned on security-report text, which matters — general-purpose embeddings don’t know that “SSRF to metadata endpoint” and “AWS IMDS abuse” are the same thing.

Those three embeddings get combined with weights we’ve tuned from real triage data. Then we cluster by cosine similarity against every open and recently-closed report in the program’s history.

If a new submission sits above the cluster threshold for an existing finding, it gets attached to that cluster automatically and flagged as a probable duplicate. If it falls below the threshold for every cluster, it opens a new cluster.

In our current calibration we hit around 90% dedup accuracy on submissions that have clear ground truth — meaning that nine times in ten, the system’s dupe-or-unique call matches what a senior triager would decide. The median time from submission landing to a dedup decision is under five minutes.

The threshold tradeoff

There’s no perfect similarity threshold. It’s a tradeoff you have to make honestly.

Tight threshold — high precision on dedup, but you miss real duplicates. Triagers end up reviewing the same underlying bug three times because the reports phrased it differently. Good for researcher trust, bad for ops.

Loose threshold — you catch all the real dupes but you also merge reports that aren’t really duplicates. A researcher who submitted something genuinely new sees their report attached to someone else’s cluster and marked “duplicate,” and their trust in the program evaporates. That’s how bounty programs get a bad reputation on researcher forums.

Our default calibration leans slightly toward tight — we’d rather do a small amount of extra triage than falsely merge — and every auto-dedup decision is reversible by a human arbitrator. If a researcher disputes a dupe decision, the case goes to arbitration, and the arbitrator’s outcome feeds back into the threshold tuning for that program’s cluster history. Over a few months of operation, the thresholds for each program converge on something calibrated to that program’s specific attack surface.

Severity scoring that respects chains

The other place bounty platforms reliably get it wrong is severity. A bug-bounty SSRF that can only hit a handful of internal IPs might score as a medium on its own. An SSRF that reaches the cloud metadata service and returns temporary credentials is a critical — same bug class, radically different impact.

Hive uses exploit-chain reasoning for severity. When a submission lands, we don’t just score the bug in isolation. We look at what the reported bug connects to — what reachability it grants, what other findings in the program’s history it could chain with, what assets sit behind it. A “low” SSRF plus a “medium” SSRF to metadata that the researcher didn’t mention but which exists on the same host combines into a critical chain score.

Researchers can of course submit chains explicitly, and they’re rewarded for it. The platform’s job is to catch the chain even when the submission doesn’t make it obvious, so the payout reflects actual impact.

Researcher fairness when a cluster has multiple submitters

Once you’re clustering, you have to decide who gets credit. The rule we’ve landed on is first-submitter wins the cluster. The second and subsequent submitters get an acknowledgement — their name on the cluster — but the payout goes to whoever was first in by timestamp.

This sounds obvious but matters enormously for program health. Researchers will not participate in a program where they can’t tell whether their work will be credited. A clear, unambiguous first-in rule means they know the system. And because dedup is semantic, “first” actually means first to describe the underlying bug, not first to find a particular URL — a researcher who writes a cleaner report two days later doesn’t get to bump an earlier, messier report that identified the same defect.

Edge cases — partial overlap, chains that extend someone else’s finding — go to human arbitration. The arbitrator decides how to split credit (full award to each, partial split, bonus for the chain extender) and the outcome is written down so future similar cases get consistent treatment.

Payouts that work for Indian researchers

A bounty program that takes two weeks to pay is a bounty program losing its researchers. Hive runs UPI payouts within 48 hours for Indian researchers once a finding is validated and approved. ACH and SWIFT for international researchers, with slower settlement windows by banking reality, not by us dragging our feet.

The onboarding flow supports regional languages — Hindi, Marathi, Tamil, and Telugu at launch — because the next generation of Indian researchers doesn’t all prefer to fill out tax and KYC forms in English, and the programs that figure this out first will attract the deepest research pool.

Triage SLAs and what “at scale” really means

A bounty program running at scale is not a pipeline-of-triage-tickets. It’s a production system with SLAs. Hive exposes median triage time, p90 time, open cluster count, researcher response time, and payout latency as first-class metrics per program. A program owner can see at a glance whether their program is healthy or whether triage is falling behind.

Policy templates for BFSI, government, and SaaS workloads pre-fill scope, severity rubrics, and exclusion language so a new program doesn’t start with a blank policy page. Those templates were drafted with input from programs operating under RBI and SEBI guidance and reflect what regulators expect from a bounty running inside a regulated institution.

What scale actually requires

“Run a bug bounty at scale” usually gets pitched as a sourcing problem — how do you attract more researchers. The sourcing part is real but overrated. The hard part is operational: dedup you can trust, severity that reflects actual impact, clear fairness rules, payouts that settle fast, and SLAs you can measure.

Get those right and the program runs itself. Get them wrong and no amount of researcher recruitment will save you, because the good researchers will stop showing up.

Hive is the bet that dedup is the load-bearing piece, that semantic clustering plus human arbitration gets it right often enough to be worth running, and that a platform designed around this principle will save AppSec teams the one thing they never have enough of — attention.