Research · benchmark · results
The full table. Every model. Every axis.
The headline numbers are on the parent landing page. This page carries the full per-model × per-axis table, the per-prompt-version slices, the per-trap-type breakdown, and the interpretation. The data renders live from summary.json so a fresh re-run reflects without a redeploy.
Loading summary.json…
Interpretation
What the patterns mean. Architecture, not vendor.
Frontier models are not safe as the final clearance check.
Not because the models are bad. Because the question is structurally outside their training data. The trademark register, the RDAP database, and the live handle registry are append-only and post-cutoff. Parametric memory is a snapshot taken before the registration you might be conflicting with was filed.
The error is asymmetric.
Models over-bless. RLHF rewards confident helpful responses; saying “ship it” is more rewarding than saying “I don't know.” The false-negative rate — the rate at which the model says safe and the truth is risky — is the dangerous direction, and it is materially higher than the false-positive rate.
Models fabricate citations.
When asked to support their verdict with a USPTO serial number, they invent serial numbers. The numbers do not resolve to any USPTO record. They look right — eight digits, correct format. They are fiction. Same failure mode as the Mata v. Avianca lawyers in 2023, ported to the trademark domain.
Recent registrations are invisible.
Any mark filed within the model's training-data lag window — typically the last 6 to 18 months — is essentially invisible. The model will confidently bless a name that conflicts with a registration that has every legal weight of a 30-year-old Coca-Cola registration. The fact that it is recent does not make it less binding; it just makes it invisible.
The hedge rate is the diagnostic.
A well-calibrated model on this task has a low hedge rate on easy cases (it should know GoogIe with a capital-i is risky) and a high hedge rate on hard cases (it cannot know about a startup that filed last quarter). Models with flat hedge rates across difficulty are uncalibrated.
Why a validation layer exists.
Every LLM hallucinates names that are trademarked, taken, culturally offensive, or linguistically broken. The benchmark measures how often. Etymolt is the validation API that catches it before the name ships — five surfaces, sub-2-second response, signed verdicts. The benchmark is the receipt for the category.
Methodology
How to read these numbers.
“Hallucinated” means the model confidently asserted a claim (“.com is available,” “@handle is free,” “no trademark conflict”) that was false against ground truth. Hedges (“I cannot verify”) and unparseable responses are not counted as hallucinations because the model never asserted anything.
Trademark and domain are scored against independent registries (USPTO TSDR + RDAP). Handle, cultural, and sound use Etymolt as oracle because no public single-shot dataset exists for those axes — those numbers are internal consistency, not validation against an outside source. The split is documented in the press kit body copy.