Disposable email false positives: 380-domain allowlist


The first time someone runs a pattern-based disposable-email detector and it flags harvard.edu, they have a bad day. Then they discover it also flags gov.uk because the pattern table contains some country-TLD heuristic that fires on .uk subdomains. Then they find it flags gmail.com because a community blocklist forked from a 2019 list that contained one bad entry. By the time they’re done, they’ve discovered that disposable detection without a guardrail layer is a recipe for blocking the exact users you most want to keep.

The fix is an explicit allowlist. In our production database, the legit_mail_domains table has 380 entries, organized into 8 categories, that override any disposable claim regardless of source. Gmail signals get returned legitimate even if some blocklist somewhere managed to add gmail.com. Harvard.edu signals get returned legitimate even if a pattern matched. Government TLDs (.gov, .gov.uk, .gc.ca, etc.) get the safety net. The mechanism is simple: Tier 0 wins.

This post: what’s in the allowlist, why each category matters, and how the override mechanic protects against the false-positive failure mode that breaks naive disposable detectors.

What’s actually in the 380 entries

The legit_mail_domains table splits across 8 categories:

Category Examples Coverage
webmail-public gmail.com, yahoo.com, outlook.com, hotmail.com, aol.com, icloud.com, mail.com, zoho.com, ymail.com, live.com Top consumer providers
isp comcast.net, att.net, verizon.net, charter.net, telus.net, btinternet.com, sky.com, orange.fr, telekom.de Top ISPs across major markets
corporate apple.com, microsoft.com, salesforce.com, amazon.com, oracle.com, ibm.com, etc. Top employer corporate mail domains
education harvard.edu, mit.edu, stanford.edu, cam.ac.uk, ox.ac.uk, eth.ch, tsinghua.edu.cn Major universities
government whitehouse.gov, state.gov, gov.uk, parliament.uk, europa.eu, canada.ca, gov.au Public sector
regional-webmail mail.ru, qq.com, 163.com, sina.com, rambler.ru, naver.com, daum.net, gmx.de, web.de Non-Western consumer webmail
privacy-mail-legit protonmail.com, proton.me, tutanota.com, simplelogin.io, mozmail.com, duck.com Legitimate privacy services (some overlap with alias-forwarder category)
hosting-default googlemail.com, googlemail.de Webmail aliases of major providers

Two things to notice. First, the list is short — 380 is small. The reason it’s small is that it doesn’t need to be big. The Pareto distribution of legitimate mail traffic concentrates hard: probably 95% of legitimate email worldwide originates from accounts at the top 380 mail domains.

Second, the categories aren’t just informational. They drive the safety-net rules for entire TLDs.

The TLD safety nets

The allowlist has 380 explicit entries, but it also drives implicit overrides for entire top-level domains:

These are the tld-untouchable rules in the detection_source field — they show up in the source breakdown for ~251 detection events per refresh cycle, mostly low-volume edu and gov hits that would otherwise trip a borderline pattern.

Why the allowlist wins over the blocklist

The mechanic is the order of operations. Tier 0 is the allowlist check; it runs first and overrides everything. The pseudocode for the detection cascade:

function classify(domain) {
  // Tier 0: allowlist override
  if (allowlist.has(domain) || tldUntouchable(domain)) {
    return { verdict: 'legitimate', confidence: 100, tier: 0 };
  }

  // Tier 1: disposable_mail_domains direct hit
  if (disposable.has(domain)) {
    return disposable.get(domain);
  }

  // Tier 2-3: MX/IP cluster detection
  return resolveAndClassify(domain);
}

The allowlist hit short-circuits. No DNS lookup, no MX check, no fingerprint match. If gmail.com is in the allowlist, gmail.com returns legitimate regardless of what any blocklist anywhere thinks. The same logic for harvard.edu.

This means a known-bad community blocklist that incorrectly includes gmail.com doesn’t damage us. The blocklist gets merged into the disposable table at sync time, but the Tier 0 override catches it. We log the overridden insertion as a flag-for-review (so the blocklist provider can be deprioritized if it keeps producing false positives) without ever returning a wrong verdict to a customer.

How a new domain gets onto the allowlist

Three entry paths:

  1. Curated seed. The initial 380 entries were curated from a combination of Gemini-generated candidate lists (with the gemini-legit source tag), manual review of top mail domains by global traffic share, and explicit additions for top employers/universities/governments. Source-tagged for audit.

  2. Manual addition. Operations staff can add a domain explicitly via the admin UI. Typical reasons: a customer reports being blocked, we investigate, the domain turns out to be a small regional ISP that wasn’t in the seed. Add it with source: 'manual' and notes explaining the addition.

  3. Confidence promotion. Theoretically — and this is a backstop we’ve never actually used in production yet — a domain that consistently scores legitimate across multiple signal layers (high deliverability, healthy bounce rate, recognized DKIM signing, etc.) could auto-promote. We’ve kept this off pending more conservative tuning; for now, every addition is human-reviewed.

The schema:

CREATE TABLE legit_mail_domains (
  domain        TEXT PRIMARY KEY,
  category      TEXT NOT NULL,  -- one of the 8 categories
  confidence    INTEGER NOT NULL,  -- 0-100
  source        TEXT NOT NULL,  -- gemini-legit | manual | seed
  notes         TEXT,
  created_at    INTEGER NOT NULL
);

The category field drives the verdict’s detection_source (e.g., legit-allowlist:webmail-public), which surfaces to operators auditing why a specific verdict landed.

What this prevents in practice

Concrete failure modes the allowlist averts:

The cumulative effect: most of the foot-guns of running disposable detection are headed off at Tier 0.

What this looks like in your verify call

When /v1/check hits an allowlist entry, you get a clean legitimate verdict:

{
  "result": "deliverable",
  "reason": "valid_mailbox",
  "disposable": false,
  "spam_trap": false,
  "score": 1.0,
  "detection_source": "legit-allowlist:webmail-public"
}

The detection_source field tells you exactly why — the domain matched the webmail-public category in the allowlist. Useful for auditing or for sophisticated handlers that want to know whether a deliverable verdict came from the safety net or from actual disposable-database miss-and-pass.

Why this matters for production signup forms

Three operational consequences:

  1. You can rely on the verdict. A disposable or undeliverable response from vrfymail is unlikely to be a false positive on a legitimate consumer domain — the allowlist catches that class of error. You can act on the verdict without defensive client-side checking.

  2. Customer-support load drops. The single highest-volume customer-support ticket for any disposable-detection tool is “your service blocked my real address from a real domain.” The allowlist eliminates the common case (top consumer, ISP, edu, gov, corporate) of this ticket type.

  3. Regulatory-sensitive flows are workable. If you’re running a healthcare app where harvard.edu could be a researcher with a legitimate need, or a government-services portal where gov.uk is a real user, the allowlist makes the disposable detection safe to deploy. Without it, the false-positive cost dominates the false-negative cost and most teams just turn the detection off.

Connection to the broader operator graph

The allowlist is the Tier 0 safety net for the rest of the detection stack. The operator graph covers ~91% of disposable mail; the CT-log scanner keeps it fresh; the allowlist prevents the entire detection cascade from over-blocking. All three together produce the precision/recall envelope our disposable email checker API commits to.

FAQ

Why 380 entries and not a thousand?

The allowlist isn’t trying to enumerate every legitimate domain on the internet (impossible). It’s trying to cover the high-volume false-positive surface — the domains that would generate the most customer-support tickets if accidentally flagged. That set is small and concentrated.

Is the allowlist published?

Not as a downloadable list, but the verdict on any domain you query reflects allowlist membership (via detection_source: legit-allowlist:*). If you need a custom integration, the explicit categories are observable through the API responses.

What if I want to add my own corporate domain to be safe?

If you’re verifying email from your own employees’ corporate domain and want to make absolutely sure we never flag it, you can pre-cache the verdict on your side or report the domain for review. We add legitimate corporate domains on request, especially mid-market and large companies whose domains aren’t already in the corporate-category seed.

Does the allowlist apply to subdomains too?

The TLD safety net (edu/gov/mil/int) applies to subdomains. The explicit 380-domain list is apex-only by default — harvard.edu is allowlisted, physics.harvard.edu falls under the .edu safety net rather than the explicit list. The effect is the same; the mechanism differs slightly.

What about ESPs that send on behalf of disposable services?

ESPs that send mail (Mailgun, Postmark, SendGrid, Resend) aren’t in the legitimate-mail allowlist — they’re in a different table for sending-infrastructure recognition. The allowlist applies to receiving-mail domains (gmail.com, harvard.edu, etc.), not to ESPs that handle outbound sends.

Verdict with the safety net included

vrfymail’s /v1/check runs Tier 0 on every verify — 5µs allowlist check before any disposable claim is honored. Free tier covers 5,000 verifies/month, no card. Get an API key →