Operator fingerprint clustering disposable email


An operator running 14 publicly-different temp-mail brands has to make a choice. Either rebuild the analytics stack from scratch for each brand (separate AdSense publisher IDs, separate Google Analytics properties, separate Tag Manager containers), or paste the same IDs into all 14 sites and ship faster. In our experience, ~95% of multi-brand operators pick the second option. The result: the AdSense publisher ID 1407292178211259 appears in the HTML source of 5 separate temp-mail sites. The Google Analytics property G-SCROLLING appears across 13. Run those cross-references at scale and the “we have 945 operators in the registry” claim isn’t a guess — it’s a derived graph of distinct identifier clusters.

This post documents the methodology. Which fingerprint types we extract, how we cluster, why this beats domain-name-based clustering, and how the resulting operator graph feeds the disposable email checker API.

The 8 fingerprint types we track

Pulled from production:

Fingerprint type Total IDs Services with at least one
ga4 714 430
gtm 321 268
adsense 192 190
gads_aw 156 116
ua (legacy Google Analytics) 99 93
fb_pixel 88 69
hotjar 29 29
yandex 13 11

Read this as: of the disposable services we’ve probed, 430 of them have at least one Google Analytics 4 property ID in their HTML, and across those 430 services we’ve collected 714 distinct GA4 IDs. The ratio (714 IDs / 430 services = 1.66) reflects that some services have multiple GA4 IDs (testing properties, abandoned old ones, multi-domain properties). The reverse — multiple services sharing one GA4 ID — is the clustering signal.

The clustering signal

The relationship that produces operator clusters: one ID value, multiple distinct services.

Top hits from production:

Fingerprint type Value Services sharing
ga4 G-SCROLLING 13
ga4 G-CSLL4ZEK4L 7
adsense 1407292178211259 5
ga4 G-032TGHVY5H 4
ga4 G-117NJTJLH9 4
ga4 G-2B34CLE4LG 4
ga4 G-39YHTBT1N2 4
ga4 G-3FY38N72N5 4
ga4 G-43W7NQB96Y 4
ga4 G-4VL8MM89TP 4

The top entry, G-SCROLLING, is a particularly informative data point. That’s not a randomly-generated GA4 ID — Google’s tag-manager auto-generates IDs in the format G-[A-Z0-9]+ (typically 10 random characters). G-SCROLLING is a human-written placeholder string. 13 disposable-mail services use it. The operator behind those 13 services almost certainly copied a “default” GA4 snippet template from a tutorial or other operator’s site, never replaced the placeholder, and shipped 13 brands with the same dummy tag.

1407292178211259 is a real AdSense publisher ID. AdSense IDs are anchored to a specific Google account; 5 sites using the same ID means one Google account is monetizing 5 sites. That’s a single operator running 5 brands.

The cluster size distribution tails off quickly — most ID values appear on exactly 1 site. The interesting cases are the few-percent of IDs that appear on 4+ sites, which we use as cluster seeds.

Why this beats domain-name-based clustering

Three alternative approaches and why fingerprinting wins:

  1. Cluster by similar domain names. tempmail1.com, tempmail2.com, tempmail3.com — same operator, right? Probably. But temp-mail.com, tempmail.org, tempmail.io are run by different operators (we’ve confirmed via separate fingerprint analysis). Name similarity is a weak signal that produces false positives.

  2. Cluster by registration metadata. Same registrant email in WHOIS, same name servers, same DNS provider. This used to work; modern registrars almost universally privacy-protect WHOIS by default, so the signal is gone for most domains.

  3. Cluster by IP address or hosting. Same /24, same Hetzner account, same ASN. Useful (see our bullet-proof hosting post) but blunt — many legitimate small businesses share the same /24 as a disposable-mail operator at Hetzner. False-positive risk is high.

Fingerprint clustering hits the sweet spot: high specificity (an AdSense publisher ID is unique to one Google account; one Google account is one operator), high coverage (~67% of disposable services we scrape have at least one extractable fingerprint), and resistant to the easy evasions (changing domain name doesn’t change the ID; switching hosting provider doesn’t change the ID; only deliberately rotating the analytics stack defeats the cluster).

The Playwright extraction pipeline

The pipeline runs as part of operator-discovery probes. For each candidate disposable apex:

  1. Launch headless Chrome via Playwright. Navigate to the apex.
  2. Wait for the page to settle (DOM-content-loaded, plus 2s for dynamic script loading).
  3. Walk the DOM and inline scripts. Regex-extract: - AdSense: ca-pub-[0-9]+ pattern - GA4: G-[A-Z0-9]+ in gtag() calls or <script> srcs - GTM: GTM-[A-Z0-9]+ in script srcs - UA: UA-[0-9]+-[0-9]+ in ga() or script srcs - Facebook Pixel: fbq('init', '...') numeric ID - Yandex Metrika: ym(<numeric_id>, 'init') - Hotjar: hjSettings = { hjid: <numeric> } - Google Ads (gads_aw): AW-[A-Z0-9-]+ pattern
  4. Hash/dedupe.
  5. Store as (service_apex, fingerprint_type, fingerprint_value) rows.

The whole probe averages ~3 seconds per apex. We run it once per candidate at discovery time and periodically refresh the top-traffic operators (~quarterly) to catch ID changes.

What clustering actually produces

After enough cross-references accumulate, the operator-cluster algorithm runs. Pseudocode:

# Start: each service is its own cluster
clusters = {service: {service} for service in services}

# Merge clusters that share any fingerprint
for fp_type, fp_value, services_with_id in fingerprint_index:
    if len(services_with_id) > 1:
        merged = set()
        for service in services_with_id:
            merged |= clusters[service]
        for service in merged:
            clusters[service] = merged

# Each resulting cluster = one operator
operators = set(frozenset(c) for c in clusters.values())

After running this against the ~600 unique services with extractable fingerprints, we get approximately 200 cluster groups + 400 singletons. The 200 cluster groups become operators with >1 brand; the 400 singletons become operators with exactly 1 brand. Combined with operators identified via MX-pattern clustering and customer reports, the total registry sits at the 945-entry mark.

Concrete example: tracing bccto.me’s 14 brands

bccto.me is the operator with the most distinct brands in our database (14). The cluster forms like this:

  1. Probe identifies bccto.me as a temp-mail service. Extracts fingerprints: 1 GA4 ID, 1 GTM ID, 1 AdSense ID.
  2. Probe identifies 13 other temp-mail-pattern apexes over subsequent weeks (various random names). Some return the same GA4 ID. Some return the same GTM ID. Some return the same AdSense ID. The intersection-by-shared-ID algorithm collapses all 14 into one cluster.
  3. The cluster gets canonicalized as operator bccto-me, display name bccto.me (+14 brands), brands list populated from the 14 apex domains.
  4. The cluster’s 289 mail domains (the actual inbox domains the 14 brands hand out to users) all get tied to operator_id = bccto-me in the disposable_mail_domains table.

From a new-domain-arrival perspective: when a 290th domain shows up from any of the 14 brands’ dropdowns, it inherits the operator linkage on first ingest. No re-probing required.

Why this matters for new-domain catch rate

The catch rate equation:

The fingerprint layer is where new operator candidates get folded into existing clusters before they ever make it into the public detection table as a separate entry. That’s the discovery-time leverage: instead of cataloging 945 separate operators each with their own MX/IP profile, we maintain 200 cluster groups + 400 singletons. New domains under any of the 200 cluster groups get the cluster’s full MX/IP signature applied automatically.

What this looks like in your verify call

The fingerprint analysis is invisible to the API consumer. You get verdicts; you don’t see the operator graph behind them. A typical hit:

{
  "result": "undeliverable",
  "reason": "disposable",
  "reason_message": "This email provider doesn't deliver mail reliably. Please use a real address.",
  "disposable": true,
  "score": 0.0,
  "detection_source": "scraped-ui"
}

The detection_source: "scraped-ui" indicates the domain was captured via the UI-scraping channel — which means it went through fingerprint extraction at discovery time and now sits under a clustered operator. If you queried via the API and got a deliverability score from an operator-cluster catch versus a domain-blocklist hit, the verdict shape is identical. The operator-graph machinery is the back-end optimization, not a product surface.

Limitations of fingerprint clustering

Three known limitations worth naming:

  1. Operators that scrub their stacks defeat the cluster. A sophisticated operator who rotates GA4 IDs per-brand and uses separate AdSense accounts per-brand looks like 10 independent operators in our table even if it’s one human running them. We catch them via other channels (CT-log, MX-pattern, customer-consensus) but the fingerprint advantage disappears.

  2. Legitimate co-tenant clustering risk. Two unrelated operators running on the same WordPress hosting service might inherit the same GTM ID from a managed-service template. We mitigate this by requiring 2+ fingerprint type overlaps before clustering (one shared GA4 isn’t enough; one shared GA4 + one shared AdSense ID is).

  3. Privacy-respecting operators don’t have fingerprints to extract. A temp-mail site running zero tracking has zero IDs in its HTML. These are caught by other signals (MX clustering, customer-consensus) but invisible to the fingerprint layer.

How the methodology connects to broader operator analysis

Fingerprint clustering produces the operator-to-brands edges. The operator graph itself (945 operators, 265K+ domains, top-15-by-count) is the macro view. Specific operator deep-dives like the Yopmail empire show the methodology applied to a single named cluster. The fingerprint layer is the engine; the operator graph and brand-attribution analyses are the output.

FAQ

Is this legal?

Reading IDs from publicly-served HTML is reading public data. We don’t bypass authentication, defeat tracking-prevention, or process personally-identifiable information — the IDs themselves identify operators (publishers, not visitors). The methodology is essentially what any web-developer Chrome inspector reveals manually, automated for scale.

Can operators evade this by rotating IDs?

Yes — and some sophisticated operators do. Rotating GA4 IDs per-brand defeats the GA4 cluster. Using separate AdSense accounts defeats the AdSense cluster. Using zero tracking defeats the whole approach. The trade-off for the operator: lose the analytics dashboard convenience to gain detection-resistance. Most operators choose convenience because they don’t think anyone’s clustering them.

What if multiple legitimate sites share an analytics property?

A managed-service template (like a WordPress.com free tier) might inject a shared GTM ID. We filter these out via the 2+ overlap rule and by tagging known platform-managed IDs as “shared-infrastructure” with no clustering effect. The operator-cluster table only forms when multiple unique IDs cross-reference.

How fresh is the fingerprint data?

Initial probe at discovery (typically within 24-48h of first encountering an operator). Refresh probes for top-traffic operators run quarterly. Long-tail operators stay at their original fingerprint snapshot until something triggers a re-probe.

Can I get the operator-cluster mapping for my own analysis?

The verdict from /v1/check reflects operator-cluster membership but doesn’t surface the cluster identifier publicly. For internal investigation of a specific operator’s brand footprint, we can share findings on request — investigative work is a service for higher-tier customers.

The cluster runs on every verify

vrfymail’s /v1/check uses the operator-cluster graph for every Tier-1 hit. 50ms p50. Free tier: 5,000 verifies/month, no card. Get an API key →