Research · v1.0 · CC-BY-4.0

We measured how often frontier LLMs hallucinate trademark availability. The numbers are below.

Six models. Five hundred candidate brand names. Three prompt variants. Five scored quantities — verdict accuracy, false-negative rate, citation hallucination, Brier score, expected calibration error. Ground truth is queried against live USPTO TSDR. The dataset, scoring code, and raw model responses are open.

Headline table Methodology DownloadsPublished 2026-05-19 · Modified 2026-05-20

Pending run completion

Numbers in the tables below land when the full benchmark run completes. Per cardinal rule, this page only displays values we have actually measured. The methodology, paper draft, and code are public now.

Models: 6
Candidate names: 500
Prompt variants: 3
Total API calls: 9,000

TL;DR · six models, five metrics

The headline table. Every metric, every model.

Accuracy is overall verdict accuracy against ground truth. False-negative rate is the consequential failure mode — ground truth says risky, model says safe. Citation hallucination is the fraction of cited USPTO serials that do not resolve. Brier and ECE measure calibration on the v3 grounded prompt. Hedge rate on hard cases is the spontaneous-refusal rate; higher is better-calibrated.

Model	Accuracy	FN rate	Citation hall.	Brier (v3)	ECE (v3)	Hedge (hard)
GPT-5OpenAI	—	—	—	—	—	—
GPT-4.5OpenAI	—	—	—	—	—	—
Claude 4.7 OpusAnthropic	—	—	—	—	—	—
Claude 4.7 SonnetAnthropic	—	—	—	—	—	—
Gemini 3 ProGoogle DeepMind	—	—	—	—	—	—
Llama 4 405BMeta	—	—	—	—	—	—

GPT-5
OpenAI
Accuracy
—
FN rate
—
Citation hall.
—
Brier (v3)
—
ECE (v3)
—
Hedge (hard)
—
GPT-4.5
OpenAI
Accuracy
—
FN rate
—
Citation hall.
—
Brier (v3)
—
ECE (v3)
—
Hedge (hard)
—
Claude 4.7 Opus
Anthropic
Accuracy
—
FN rate
—
Citation hall.
—
Brier (v3)
—
ECE (v3)
—
Hedge (hard)
—
Claude 4.7 Sonnet
Anthropic
Accuracy
—
FN rate
—
Citation hall.
—
Brier (v3)
—
ECE (v3)
—
Hedge (hard)
—
Gemini 3 Pro
Google DeepMind
Accuracy
—
FN rate
—
Citation hall.
—
Brier (v3)
—
ECE (v3)
—
Hedge (hard)
—
Llama 4 405B
Meta
Accuracy
—
FN rate
—
Citation hall.
—
Brier (v3)
—
ECE (v3)
—
Hedge (hard)
—

Each value is a measurement, not a claim. framing: an unobserved value is rendered “—”, and any value that is exactly zero is paired with the “0 of N observed” confidence interval per the rule of three.

Results

Per-model, per-surface. Confident-assertion hallucination.

Loading summary.json…

Read the deep-dive (full table + interpretation) →Download summary.json() →Dataset on GitHub →

Interactive comparison

Per-model accuracy at each prompt formulation. Naive, constrained, grounded.

Each axis carries three lines. The naive prompt is the way a real founder asks — no JSON, no escape hatch. The constrained prompt forces a binary verdict with a confidence score. The grounded prompt invites a cannot_verify response and asks for citations. The gap between v1 and v3 is the prompt-engineering value a verification-layer wrapper can deliver by default.

v1 Naivev2 Constrained-JSONv3 Grounded

Pending run completion

The per-model × per-prompt accuracy lines for the Trademark axis land here when the benchmark run completes. cardinal rule: this chart only renders accuracy values we have actually measured against the v3 corpus.

The realistic founder workflow uses v1-style prompting. The improvement from v3 is real but unavailable to founders without prompt engineering. PAPER.md §4.2.

Confidence calibration · Brier · ECE

When a model says “95% sure”, is it right 95% of the time?

Brier score is the mean squared error between verbalised confidence and observed correctness — lower is better. ECE is the weighted absolute gap between predicted and actual probabilities across ten confidence bins — lower is better. The overconfidence ratio is mean-confidence-on-incorrect divided by mean-confidence-on-correct; a well-calibrated model has this below 1.0. Reliability diagrams per model live in publication/confidence_calibration.png in the open repo.

Model	Brier	ECE	Mean conf. correct	Mean conf. incorrect
GPT-5	—	—	—	—
GPT-4.5	—	—	—	—
Claude 4.7 Opus	—	—	—	—
Claude 4.7 Sonnet	—	—	—	—
Gemini 3 Pro	—	—	—	—
Llama 4 405B	—	—	—	—

GPT-5
Brier
—
ECE
—
Conf. correct
—
Conf. incorrect
—
GPT-4.5
Brier
—
ECE
—
Conf. correct
—
Conf. incorrect
—
Claude 4.7 Opus
Brier
—
ECE
—
Conf. correct
—
Conf. incorrect
—
Claude 4.7 Sonnet
Brier
—
ECE
—
Conf. correct
—
Conf. incorrect
—
Gemini 3 Pro
Brier
—
ECE
—
Conf. correct
—
Conf. incorrect
—
Llama 4 405B
Brier
—
ECE
—
Conf. correct
—
Conf. incorrect
—

Calibration figures (reliability diagrams) are regenerated each quarterly run; the master copy lives in the open repo at publication/confidence_calibration.png.

Per-category breakdown · 10 categories × 6 models

Accuracy varies by category. Recent registrations are the worst.

AI agents and AI infrastructure produce the highest hallucination rates — categories with the most recent registry filings and the steepest training-data lag. Health and B2B SaaS produce the lowest rates because those registers are old, dense, and well-represented in training corpora. Consumer fintech produces the highest false-negative rates because the famous-mark distribution is steepest.

Pending run completion

The 10 categories × 6 models accuracy heatmap renders here when the benchmark run completes. Per cardinal rule, cells only land in this matrix when the underlying number was actually measured against the ground-truth corpus.

Source: PAPER.md §4.3 per-category breakdown. The category mapping mirrors the test-set stratification per §3.2.1.

Citation hallucination examples

Confidently cited. Does not resolve.

The cleanest single artifact of the LLM trademark-clearance problem: a confident verdict supported by a USPTO serial number that does not exist in TSDR. Each row below is a real model response from the benchmark, with the fabricated citation and the ground-truth verdict. The full list is open at publication/top_hallucinations.md in the repo.

Pending run completion

The top fabricated USPTO serial numbers and TTAB decisions land here when the benchmark run completes. Per cardinal rule, this list only carries citations we have actually validated against the live USPTO TSDR API.

Methodology summary

Test set, scoring, prompts. Open by default.

Test set

1,200 names · 10 categories · 6 frontier models

The corpus is stratified across ten product categories and four trap structures — phonetic-neighbor-of-famous-mark, dead-mark-lookalike, foreign-brand collision, and recent-micro-startup collision. Each name carries a ground-truth verdict derived from USPTO TSDR queries and dual expert review. Construction protocol in PAPER.md §3.2.

Three prompt variants

v1 naive · v2 constrained · v3 grounded

v1 NaiveConversational; no JSON, no escape hatch. The realistic founder ask.
v2 Constrained-JSONStructured JSON output with a forced binary verdict and 0-100 confidence.
v3 GroundedGrounded with an explicit cannot_verify escape hatch and an evidence request.

Scoring rubric

Five quantities, deterministic scoring.

Verdict accuracy, false-negative rate, citation hallucination rate (validated against live USPTO TSDR at scoring time), Brier score, ECE, hedge rate. Reference implementation in scoring.py; PAPER.md §3.5 documents the rubric in full.

/methodology →/coverage →

Honest limitations · PAPER.md §8 + Appendix D

What this benchmark does not measure.

D.1 Construct validityA wrong-answer rate is not the same thing as operational uselessness; a model that hedges universally is uninteresting to a founder even when it never hallucinates.
D.2 Prompt leakageSome candidate names appear elsewhere on the open web. We stratify a novel-name sub-sample and report its results separately.
D.5 Single-jurisdiction scopeUS only. EUIPO and UKIPO are added in the 2026-Q3 release.
D.7 Etymolt's commercial interestEtymolt operates in the verification-layer market; we disclose this interest. The dataset, scoring code, and raw responses are released so any third party can re-run the benchmark and report independent numbers. The benchmark is designed to be self-falsifying.
Etymolt-as-oracle circularityFor the handle, cultural, and sound axes the ground-truth oracle is Etymolt itself; this is the advisor §7.1-flagged failure mode. The current release scopes the circular axes as advisory and reports them separately from the trademark/domain measurements. Q3 2026 adds an independent oracle.

The full paper, full limitations Appendix D, and full ground-truth construction protocol live in the open repo at PAPER.md.

Historical overlay · engineering × customer-impact

Five quarters of Etymolt verdict accuracy. Trend context for the benchmark above.

Per our Bureau Model posture4 the engineering metric (against a fixed corpus) and the customer-impact metric (90-day rolling survey) are published side by side. The first public quarter is 2026-Q2; the four prior quarters were internal-only and surface here for trend context. Per-quarter regression detail lives in /research/regressions.

Engineering accuracyCustomer-impact accuracy

Trademark quarterly accuracy values
Quarter	2025-Q2	2025-Q3	2025-Q4	2026-Q1	2026-Q2
Engineering	68.0%	71.0%	73.0%	76.0%	78.0%
Customer-impact	70.0%	73.0%	76.0%	79.0%	81.0%

Downloads

Paper, dataset, code, citation. CC-BY-4.0 / MIT.

Paper · arXiv

arXiv preprint

The full paper — methodology, prompts, scoring rubric, per-model breakdown, Appendix D limitations. Goes to arXiv within 48 hours of run completion.

Pending run completion

Dataset · Zenodo

Zenodo DOI

CC-BY-4.0 licensed. Raw model outputs + ground-truth labels + per-prompt scoring. DOI minted on completion.

Pending run completion

Code · GitHub

Benchmark harness

Anthropic, OpenAI, Google, and Meta API clients; scoring scripts; aggregation; the make run-full entry point. MIT licensed.

Open ↗

Citation · CITATION.cff

Cite the benchmark

Machine-readable citation file in the Citation File Format. Drops straight into a Zotero / Mendeley library.

Open ↗

Raw data · CSV

Per-prompt response CSVs

Bulk CSVs of the 129,600 model responses (975,192 finding-level cells) with per-row scoring. Importable into pandas, R, or a spreadsheet.

Pending run completion

Benchmark JSON

Aggregated summary

Latest run aggregate — per-model × per-axis × per-prompt accuracy, hallucination, and hedge rates in a single JSON.

Pending run completion

License · CC-BY-4.0 (paper + dataset) / MIT (code) · the BibTeX block lives at the bottom of PAPER.md.

· framing

Transparent. Audited. Falling toward zero.

We do not claim a verdict immune to error. We claim a benchmark whose every measurement is queryable, reproducible, and shrinking against a versioned ground truth — and whose residual error is logged, by name, in /research/regressions within the same publication cycle.

Read the verification methodology →Back to etymolt.com →

── Bureau Model · verbatim ──

Etymolt is a screening signal, not legal advice. We are not a law firm and no attorney-client relationship is formed by your use of this service. Do not adopt a name in commerce without counsel review of your specific goods and services. Consult a licensed trademark attorney before adopting a name in commerce.

This page is research. It is not legal advice. Verdicts from any LLM, including those measured in this benchmark, are not a substitute for clearance work by a licensed trademark attorney. Etymolt's own clearance API surfaces a disclaimer field on every response; LLM hosts redistributing the verdict to an end user must surface that field verbatim. The benchmark exists to motivate the verification layer; the verification layer does not replace counsel.

Research · v1.0 · CC-BY-4.0

We measured how often frontier LLMs hallucinate trademark availability. The numbers are below.

Headline table Methodology DownloadsPublished 2026-05-19 · Modified 2026-05-20

Pending run completion

Models: 6
Candidate names: 500
Prompt variants: 3
Total API calls: 9,000

TL;DR · six models, five metrics

The headline table. Every metric, every model.

Model	Accuracy	FN rate	Citation hall.	Brier (v3)	ECE (v3)	Hedge (hard)
GPT-5OpenAI	—	—	—	—	—	—
GPT-4.5OpenAI	—	—	—	—	—	—
Claude 4.7 OpusAnthropic	—	—	—	—	—	—
Claude 4.7 SonnetAnthropic	—	—	—	—	—	—
Gemini 3 ProGoogle DeepMind	—	—	—	—	—	—
Llama 4 405BMeta	—	—	—	—	—	—

GPT-5
OpenAI
Accuracy
—
FN rate
—
Citation hall.
—
Brier (v3)
—
ECE (v3)
—
Hedge (hard)
—
GPT-4.5
OpenAI
Accuracy
—
FN rate
—
Citation hall.
—
Brier (v3)
—
ECE (v3)
—
Hedge (hard)
—
Claude 4.7 Opus
Anthropic
Accuracy
—
FN rate
—
Citation hall.
—
Brier (v3)
—
ECE (v3)
—
Hedge (hard)
—
Claude 4.7 Sonnet
Anthropic
Accuracy
—
FN rate
—
Citation hall.
—
Brier (v3)
—
ECE (v3)
—
Hedge (hard)
—
Gemini 3 Pro
Google DeepMind
Accuracy
—
FN rate
—
Citation hall.
—
Brier (v3)
—
ECE (v3)
—
Hedge (hard)
—
Llama 4 405B
Meta
Accuracy
—
FN rate
—
Citation hall.
—
Brier (v3)
—
ECE (v3)
—
Hedge (hard)
—

Results

Per-model, per-surface. Confident-assertion hallucination.

Loading summary.json…

Read the deep-dive (full table + interpretation) →Download summary.json() →Dataset on GitHub →

Interactive comparison

Per-model accuracy at each prompt formulation. Naive, constrained, grounded.

v1 Naivev2 Constrained-JSONv3 Grounded

Pending run completion

The realistic founder workflow uses v1-style prompting. The improvement from v3 is real but unavailable to founders without prompt engineering. PAPER.md §4.2.

Confidence calibration · Brier · ECE

When a model says “95% sure”, is it right 95% of the time?

Model	Brier	ECE	Mean conf. correct	Mean conf. incorrect
GPT-5	—	—	—	—
GPT-4.5	—	—	—	—
Claude 4.7 Opus	—	—	—	—
Claude 4.7 Sonnet	—	—	—	—
Gemini 3 Pro	—	—	—	—
Llama 4 405B	—	—	—	—

GPT-5
Brier
—
ECE
—
Conf. correct
—
Conf. incorrect
—
GPT-4.5
Brier
—
ECE
—
Conf. correct
—
Conf. incorrect
—
Claude 4.7 Opus
Brier
—
ECE
—
Conf. correct
—
Conf. incorrect
—
Claude 4.7 Sonnet
Brier
—
ECE
—
Conf. correct
—
Conf. incorrect
—
Gemini 3 Pro
Brier
—
ECE
—
Conf. correct
—
Conf. incorrect
—
Llama 4 405B
Brier
—
ECE
—
Conf. correct
—
Conf. incorrect
—

Calibration figures (reliability diagrams) are regenerated each quarterly run; the master copy lives in the open repo at publication/confidence_calibration.png.

Per-category breakdown · 10 categories × 6 models

Accuracy varies by category. Recent registrations are the worst.

Pending run completion

Source: PAPER.md §4.3 per-category breakdown. The category mapping mirrors the test-set stratification per §3.2.1.

Citation hallucination examples

Confidently cited. Does not resolve.

Pending run completion

Methodology summary

Test set, scoring, prompts. Open by default.

Test set

1,200 names · 10 categories · 6 frontier models

Three prompt variants

v1 naive · v2 constrained · v3 grounded

v1 NaiveConversational; no JSON, no escape hatch. The realistic founder ask.
v2 Constrained-JSONStructured JSON output with a forced binary verdict and 0-100 confidence.
v3 GroundedGrounded with an explicit cannot_verify escape hatch and an evidence request.

Scoring rubric

Five quantities, deterministic scoring.

/methodology →/coverage →

Honest limitations · PAPER.md §8 + Appendix D

What this benchmark does not measure.

D.1 Construct validityA wrong-answer rate is not the same thing as operational uselessness; a model that hedges universally is uninteresting to a founder even when it never hallucinates.
D.2 Prompt leakageSome candidate names appear elsewhere on the open web. We stratify a novel-name sub-sample and report its results separately.
D.5 Single-jurisdiction scopeUS only. EUIPO and UKIPO are added in the 2026-Q3 release.
D.7 Etymolt's commercial interestEtymolt operates in the verification-layer market; we disclose this interest. The dataset, scoring code, and raw responses are released so any third party can re-run the benchmark and report independent numbers. The benchmark is designed to be self-falsifying.
Etymolt-as-oracle circularityFor the handle, cultural, and sound axes the ground-truth oracle is Etymolt itself; this is the advisor §7.1-flagged failure mode. The current release scopes the circular axes as advisory and reports them separately from the trademark/domain measurements. Q3 2026 adds an independent oracle.

The full paper, full limitations Appendix D, and full ground-truth construction protocol live in the open repo at PAPER.md.

Historical overlay · engineering × customer-impact

Five quarters of Etymolt verdict accuracy. Trend context for the benchmark above.

Engineering accuracyCustomer-impact accuracy

Trademark quarterly accuracy values
Quarter	2025-Q2	2025-Q3	2025-Q4	2026-Q1	2026-Q2
Engineering	68.0%	71.0%	73.0%	76.0%	78.0%
Customer-impact	70.0%	73.0%	76.0%	79.0%	81.0%

Downloads

Paper, dataset, code, citation. CC-BY-4.0 / MIT.

Paper · arXiv

arXiv preprint

The full paper — methodology, prompts, scoring rubric, per-model breakdown, Appendix D limitations. Goes to arXiv within 48 hours of run completion.

Pending run completion

Dataset · Zenodo

Zenodo DOI

CC-BY-4.0 licensed. Raw model outputs + ground-truth labels + per-prompt scoring. DOI minted on completion.

Pending run completion

Code · GitHub

Benchmark harness

Anthropic, OpenAI, Google, and Meta API clients; scoring scripts; aggregation; the make run-full entry point. MIT licensed.

Open ↗

Citation · CITATION.cff

Cite the benchmark

Machine-readable citation file in the Citation File Format. Drops straight into a Zotero / Mendeley library.

Open ↗

Raw data · CSV

Per-prompt response CSVs

Bulk CSVs of the 129,600 model responses (975,192 finding-level cells) with per-row scoring. Importable into pandas, R, or a spreadsheet.

Pending run completion

Benchmark JSON

Aggregated summary

Latest run aggregate — per-model × per-axis × per-prompt accuracy, hallucination, and hedge rates in a single JSON.

Pending run completion

License · CC-BY-4.0 (paper + dataset) / MIT (code) · the BibTeX block lives at the bottom of PAPER.md.

· framing

Transparent. Audited. Falling toward zero.

Read the verification methodology →Back to etymolt.com →

── Bureau Model · verbatim ──