Operator fingerprint clustering disposable email
An operator running 14 publicly-different temp-mail brands has to make a choice. Either rebuild the analytics stack from scratch for each brand (separate AdSense publisher IDs, separate Google Analytics properties, separate Tag Manager containers), or paste the same IDs into all 14 sites and ship faster. In our experience, ~95% of multi-brand operators pick the second option. The result: the AdSense publisher ID 1407292178211259 appears in the HTML source of 5 separate temp-mail sites. The Google Analytics property G-SCROLLING appears across 13. Run those cross-references at scale and the “we have 945 operators in the registry” claim isn’t a guess — it’s a derived graph of distinct identifier clusters.
This post documents the methodology. Which fingerprint types we extract, how we cluster, why this beats domain-name-based clustering, and how the resulting operator graph feeds the disposable email checker API.
The 8 fingerprint types we track
Pulled from production:
| Fingerprint type | Total IDs | Services with at least one |
|---|---|---|
| ga4 | 714 | 430 |
| gtm | 321 | 268 |
| adsense | 192 | 190 |
| gads_aw | 156 | 116 |
| ua (legacy Google Analytics) | 99 | 93 |
| fb_pixel | 88 | 69 |
| hotjar | 29 | 29 |
| yandex | 13 | 11 |
Read this as: of the disposable services we’ve probed, 430 of them have at least one Google Analytics 4 property ID in their HTML, and across those 430 services we’ve collected 714 distinct GA4 IDs. The ratio (714 IDs / 430 services = 1.66) reflects that some services have multiple GA4 IDs (testing properties, abandoned old ones, multi-domain properties). The reverse — multiple services sharing one GA4 ID — is the clustering signal.
The clustering signal
The relationship that produces operator clusters: one ID value, multiple distinct services.
Top hits from production:
| Fingerprint type | Value | Services sharing |
|---|---|---|
| ga4 | G-SCROLLING |
13 |
| ga4 | G-CSLL4ZEK4L |
7 |
| adsense | 1407292178211259 |
5 |
| ga4 | G-032TGHVY5H |
4 |
| ga4 | G-117NJTJLH9 |
4 |
| ga4 | G-2B34CLE4LG |
4 |
| ga4 | G-39YHTBT1N2 |
4 |
| ga4 | G-3FY38N72N5 |
4 |
| ga4 | G-43W7NQB96Y |
4 |
| ga4 | G-4VL8MM89TP |
4 |
The top entry, G-SCROLLING, is a particularly informative data point. That’s not a randomly-generated GA4 ID — Google’s tag-manager auto-generates IDs in the format G-[A-Z0-9]+ (typically 10 random characters). G-SCROLLING is a human-written placeholder string. 13 disposable-mail services use it. The operator behind those 13 services almost certainly copied a “default” GA4 snippet template from a tutorial or other operator’s site, never replaced the placeholder, and shipped 13 brands with the same dummy tag.
1407292178211259 is a real AdSense publisher ID. AdSense IDs are anchored to a specific Google account; 5 sites using the same ID means one Google account is monetizing 5 sites. That’s a single operator running 5 brands.
The cluster size distribution tails off quickly — most ID values appear on exactly 1 site. The interesting cases are the few-percent of IDs that appear on 4+ sites, which we use as cluster seeds.
Why this beats domain-name-based clustering
Three alternative approaches and why fingerprinting wins:
-
Cluster by similar domain names.
tempmail1.com,tempmail2.com,tempmail3.com— same operator, right? Probably. Buttemp-mail.com,tempmail.org,tempmail.ioare run by different operators (we’ve confirmed via separate fingerprint analysis). Name similarity is a weak signal that produces false positives. -
Cluster by registration metadata. Same registrant email in WHOIS, same name servers, same DNS provider. This used to work; modern registrars almost universally privacy-protect WHOIS by default, so the signal is gone for most domains.
-
Cluster by IP address or hosting. Same /24, same Hetzner account, same ASN. Useful (see our bullet-proof hosting post) but blunt — many legitimate small businesses share the same /24 as a disposable-mail operator at Hetzner. False-positive risk is high.
Fingerprint clustering hits the sweet spot: high specificity (an AdSense publisher ID is unique to one Google account; one Google account is one operator), high coverage (~67% of disposable services we scrape have at least one extractable fingerprint), and resistant to the easy evasions (changing domain name doesn’t change the ID; switching hosting provider doesn’t change the ID; only deliberately rotating the analytics stack defeats the cluster).
The Playwright extraction pipeline
The pipeline runs as part of operator-discovery probes. For each candidate disposable apex:
- Launch headless Chrome via Playwright. Navigate to the apex.
- Wait for the page to settle (DOM-content-loaded, plus 2s for dynamic script loading).
- Walk the DOM and inline scripts. Regex-extract:
- AdSense:
ca-pub-[0-9]+pattern - GA4:G-[A-Z0-9]+ingtag()calls or<script>srcs - GTM:GTM-[A-Z0-9]+in script srcs - UA:UA-[0-9]+-[0-9]+inga()or script srcs - Facebook Pixel:fbq('init', '...')numeric ID - Yandex Metrika:ym(<numeric_id>, 'init')- Hotjar:hjSettings = { hjid: <numeric> }- Google Ads (gads_aw):AW-[A-Z0-9-]+pattern - Hash/dedupe.
- Store as
(service_apex, fingerprint_type, fingerprint_value)rows.
The whole probe averages ~3 seconds per apex. We run it once per candidate at discovery time and periodically refresh the top-traffic operators (~quarterly) to catch ID changes.
What clustering actually produces
After enough cross-references accumulate, the operator-cluster algorithm runs. Pseudocode:
# Start: each service is its own cluster
clusters = {service: {service} for service in services}
# Merge clusters that share any fingerprint
for fp_type, fp_value, services_with_id in fingerprint_index:
if len(services_with_id) > 1:
merged = set()
for service in services_with_id:
merged |= clusters[service]
for service in merged:
clusters[service] = merged
# Each resulting cluster = one operator
operators = set(frozenset(c) for c in clusters.values())
After running this against the ~600 unique services with extractable fingerprints, we get approximately 200 cluster groups + 400 singletons. The 200 cluster groups become operators with >1 brand; the 400 singletons become operators with exactly 1 brand. Combined with operators identified via MX-pattern clustering and customer reports, the total registry sits at the 945-entry mark.
Concrete example: tracing bccto.me’s 14 brands
bccto.me is the operator with the most distinct brands in our database (14). The cluster forms like this:
- Probe identifies
bccto.meas a temp-mail service. Extracts fingerprints: 1 GA4 ID, 1 GTM ID, 1 AdSense ID. - Probe identifies 13 other temp-mail-pattern apexes over subsequent weeks (various random names). Some return the same GA4 ID. Some return the same GTM ID. Some return the same AdSense ID. The intersection-by-shared-ID algorithm collapses all 14 into one cluster.
- The cluster gets canonicalized as operator
bccto-me, display namebccto.me (+14 brands), brands list populated from the 14 apex domains. - The cluster’s 289 mail domains (the actual inbox domains the 14 brands hand out to users) all get tied to
operator_id = bccto-mein the disposable_mail_domains table.
From a new-domain-arrival perspective: when a 290th domain shows up from any of the 14 brands’ dropdowns, it inherits the operator linkage on first ingest. No re-probing required.
Why this matters for new-domain catch rate
The catch rate equation:
- Tier 1 catches: known domains. Static knowledge.
- Tier 2 catches: known MX patterns. Catches new domains under known operators (50ms).
- Tier 3 catches: known IP /24 ranges. Catches new domains under known operators by IP grouping.
- Tier 0 (fingerprint-driven): when we observe a new candidate at probe time, fingerprint matches collapse it into a known operator. Catches new operators with one existing relationship.
The fingerprint layer is where new operator candidates get folded into existing clusters before they ever make it into the public detection table as a separate entry. That’s the discovery-time leverage: instead of cataloging 945 separate operators each with their own MX/IP profile, we maintain 200 cluster groups + 400 singletons. New domains under any of the 200 cluster groups get the cluster’s full MX/IP signature applied automatically.
What this looks like in your verify call
The fingerprint analysis is invisible to the API consumer. You get verdicts; you don’t see the operator graph behind them. A typical hit:
{
"result": "undeliverable",
"reason": "disposable",
"reason_message": "This email provider doesn't deliver mail reliably. Please use a real address.",
"disposable": true,
"score": 0.0,
"detection_source": "scraped-ui"
}
The detection_source: "scraped-ui" indicates the domain was captured via the UI-scraping channel — which means it went through fingerprint extraction at discovery time and now sits under a clustered operator. If you queried via the API and got a deliverability score from an operator-cluster catch versus a domain-blocklist hit, the verdict shape is identical. The operator-graph machinery is the back-end optimization, not a product surface.
Limitations of fingerprint clustering
Three known limitations worth naming:
-
Operators that scrub their stacks defeat the cluster. A sophisticated operator who rotates GA4 IDs per-brand and uses separate AdSense accounts per-brand looks like 10 independent operators in our table even if it’s one human running them. We catch them via other channels (CT-log, MX-pattern, customer-consensus) but the fingerprint advantage disappears.
-
Legitimate co-tenant clustering risk. Two unrelated operators running on the same WordPress hosting service might inherit the same GTM ID from a managed-service template. We mitigate this by requiring 2+ fingerprint type overlaps before clustering (one shared GA4 isn’t enough; one shared GA4 + one shared AdSense ID is).
-
Privacy-respecting operators don’t have fingerprints to extract. A temp-mail site running zero tracking has zero IDs in its HTML. These are caught by other signals (MX clustering, customer-consensus) but invisible to the fingerprint layer.
How the methodology connects to broader operator analysis
Fingerprint clustering produces the operator-to-brands edges. The operator graph itself (945 operators, 265K+ domains, top-15-by-count) is the macro view. Specific operator deep-dives like the Yopmail empire show the methodology applied to a single named cluster. The fingerprint layer is the engine; the operator graph and brand-attribution analyses are the output.
FAQ
Is this legal?
Reading IDs from publicly-served HTML is reading public data. We don’t bypass authentication, defeat tracking-prevention, or process personally-identifiable information — the IDs themselves identify operators (publishers, not visitors). The methodology is essentially what any web-developer Chrome inspector reveals manually, automated for scale.
Can operators evade this by rotating IDs?
Yes — and some sophisticated operators do. Rotating GA4 IDs per-brand defeats the GA4 cluster. Using separate AdSense accounts defeats the AdSense cluster. Using zero tracking defeats the whole approach. The trade-off for the operator: lose the analytics dashboard convenience to gain detection-resistance. Most operators choose convenience because they don’t think anyone’s clustering them.
What if multiple legitimate sites share an analytics property?
A managed-service template (like a WordPress.com free tier) might inject a shared GTM ID. We filter these out via the 2+ overlap rule and by tagging known platform-managed IDs as “shared-infrastructure” with no clustering effect. The operator-cluster table only forms when multiple unique IDs cross-reference.
How fresh is the fingerprint data?
Initial probe at discovery (typically within 24-48h of first encountering an operator). Refresh probes for top-traffic operators run quarterly. Long-tail operators stay at their original fingerprint snapshot until something triggers a re-probe.
Can I get the operator-cluster mapping for my own analysis?
The verdict from /v1/check reflects operator-cluster membership but doesn’t surface the cluster identifier publicly. For internal investigation of a specific operator’s brand footprint, we can share findings on request — investigative work is a service for higher-tier customers.
The cluster runs on every verify
vrfymail’s /v1/check uses the operator-cluster graph for every Tier-1 hit. 50ms p50. Free tier: 5,000 verifies/month, no card. Get an API key →