info.link logo
Go back
Citations & Retrieval

Mentions, citations, and absorption: three different things, three different metrics

Mention, citation, and absorption are three distinct outcomes, and only absorption causally shapes what an AI tells a customer. Most tracking tools report citation, the middle metric. ChatGPT cites fewer sources per answer than Perplexity, but each cited ChatGPT source carries more influence.

Mention, citation, and absorption are three different outcomes, and most AI visibility tools measure only the middle one. A mention is a brand name appearing in the answer text. A citation is an explicit attribution to a source URL. Absorption is how much a cited source actually shaped the answer's wording, evidence, and structure. Only absorption causally changes what the AI tells a customer. A measurement framework that reports one number per engine, or aggregates across engines, gives a brand team a metric that looks precise and decides little. This article uses all three terms as defined here.

Mention, citation, and absorption are three measurable outcomes, not three names for one

A mention is the floor. The brand or product name appears in the answer text, with no source link required. A citation sits above it: the AI attributes a specific claim to a source URL the reader can click. Absorption is the ceiling. It is the degree to which a cited page shaped the answer's language and evidence, rather than sitting in a reference list as a courtesy. The three are measured separately, and the published frameworks treat them as distinct constructs Zhang et al. · arXiv (cs.IR) · 2026.

The distinction is not academic hair-splitting. The Princeton GEO study tested the content edits that raise visibility: adding citations, statistics, and quotations. Those edits lifted AI visibility by up to 41% on average Aggarwal et al. · Princeton University / Georgia Tech / Allen Institute for AI / IIT Delhi · 2024. That uplift is an absorption effect, not a citation-count effect. The GEO-16 empirical analysis reaches the same construct from a different angle. It scored pages on sixteen quality pillars and found the high-scoring pages cited at materially higher rates arXiv · 2025.

Treating the three as one collapses the signal. A brand can be mentioned without being cited, cited without being absorbed, and in rare cases absorbed without a formal citation. Each gap means something different for the work a brand team should do next.

Citation breadth is the wrong primary metric

A page can be cited but not absorbed. It appears in the source list while the answer is built from somewhere else. Citation breadth, the count of sources an engine lists, is the metric most tools report, and it sits in the middle of the three outcomes. It tells a brand team less than the dashboard implies.

The attribution literature has documented the gap directly. A study of source coverage found systematic bias in which pages LLM-based search engines surface, distinct from the pages that shape the answer arXiv · 2025. A separate analysis estimated how much value AI search extracts from cited publishers relative to the traffic it returns. The citation count overstates the credit a source actually receives Strauss et al. · O'Reilly Media · 2025. Attribution bias also varies with which model generated the answer, so the same page earns different treatment across engines Abolghasemi et al. · University of Amsterdam et al. · 2025.

The practical consequence is that a rising citation count can mask a falling influence. A brand that optimises for breadth can be listed more often and absorbed less, and a breadth-only dashboard will report progress.

Engine matters, and per-engine absorption matters more than cross-engine breadth

Being cited by ChatGPT is worth more than being cited by Perplexity, and the reason is absorption, not prestige. The citation-absorption framework found Perplexity cites more sources per answer while each cited page carries lower influence. ChatGPT cites fewer sources, and each cited page carries higher influence Zhang et al. · arXiv (cs.IR) · 2026. A framework that aggregates across engines, or counts citations without weighting influence, gets this backwards.

The scale of the citation behaviour underneath confirms the engines differ in kind, not just degree. An analysis of 680 million tracked citations found Wikipedia is ChatGPT's single largest source, at 7.8% of citations Profound (TryProfound) · 2025. That single source held 47.9% of ChatGPT's top-10 share. A source that dominates one engine's top-10 share is not interchangeable with one scattered thinly across another engine's longer list.

The takeaway for measurement is concrete. A brand cited five times in Perplexity and twice in ChatGPT may be more visible to customers through the two ChatGPT citations. The right comparison is influence-weighted and per-engine, not a summed citation count.

Absorption is measurable today, and the methodology is published

Absorption is not a vague aspiration. Four methods measure it, in increasing order of rigour. Lexical overlap compares the answer text to the source using metrics like ROUGE and BLEU. Semantic similarity compares sentence embeddings for meaning rather than exact wording. Claim-level attribution breaks the answer into atomic claims and matches each to its supporting evidence arXiv · 2025. Counterfactual analysis re-runs the query with one source removed and measures how the answer changes.

The first three methods are routinely applicable. Fine-grained citation generation in long-context question answering is now documented, which makes claim-level matching practical at the answer level Zhang · THUDM · 2025. The fourth method, leave-one-out counterfactual testing, is the closest thing to direct causal measurement. It is expensive and impractical at scale, because it multiplies the number of queries by the number of sources.

A measurement caution sits inside the method choice itself. Correctness and faithfulness are not the same property. An answer can state a correct fact while attributing it to a source that did not supply it Wallat · SIGIR/ICTIR 2025 · 2025. A sound absorption measure has to separate whether the claim is true from whether the cited source shaped it. Citation-failure analysis names the specific ways attribution breaks and offers efficient mitigation, which gives the measurement a checklist of failure modes to test against arXiv · 2025.

Absorption is the right metric, and it has limits that ship with it

Absorption is correlation between source and answer, not direct observation of model internals. The measurement infers influence from textual overlap and counterfactual change; it does not read the model's weights. That limit is structural and does not disappear with a better metric.

Three further limits matter for any brand team reading an absorption number. Training-data contamination can inflate scores for older, widely-quoted facts. The model may reproduce a fact it learned in training while a live source coincidentally matches it. Stochasticity affects absorption measurement the way it affects every LLM evaluation. The same query run twice can produce different answers, so a single measurement is a sample, not a constant. Quantifying that inconsistency with intraclass correlation shows how much a one-shot reading can mislead Mustahsan · arXiv · 2025. A method-of-moments approach reports absorption with a confidence interval rather than a point estimate Lior et al. · arXiv (cs.CL) / EMNLP 2025 · 2025.

Different methods also produce different absolute numbers, so the methodology has to be disclosed alongside the figure. Retrieval evaluation itself needs rethinking in the LLM era, because the older information-retrieval metrics were built for ranked document lists, not generated answers arXiv · 2025. An absorption score without its method and its variance is a number without a unit.

The right dashboard reports three numbers per engine, not one number cross-engine

A measurement strategy that follows the evidence tracks all three outcomes and reports each separately. Mention rate is the floor: how often the brand appears at all. Citation rate is the middle: how often a claim is attributed to the brand's own source. Absorption is the ceiling: how much the brand's content shaped the answer when it was cited. Never aggregate the three into a single visibility score, because the aggregation hides which layer is failing.

The same discipline applies across engines. Visibility in AI search should be measured more than once and per platform Schulte et al. · University of St. Gallen · 2026. A single cross-engine number averages away the engine-level differences that decide customer exposure. A realistic evaluation environment lets a team test how content changes move the metrics before committing to a programme arXiv · 2026.

The reading rule is simple. If a dashboard reports one number, it is almost certainly citation, and citation is the middle. Mentions are the floor and absorption is the ceiling, and a brand team that cannot see all three cannot tell which one to work on.

Write for absorption, optimise for retrieval

Absorption and citation are correlated, but they are driven by different levers, which makes them two optimisation problems rather than one. Absorption is driven by what is inside the answer: definitions, statistics, comparisons, and evidence density. The Princeton GEO finding that citations, statistics, and quotations raise visibility by up to 41% is a statement about content, not markup Aggarwal et al. · Princeton University / Georgia Tech / Allen Institute for AI / IIT Delhi · 2024.

Citation is driven by retrieval signals: machine readability, structured data, freshness, and authority. Structural feature engineering held the semantic content constant and varied only document architecture, information chunking, and visual emphasis. It lifted citation rates by 17.3% on average across six AI search engines arXiv · 2026. The semantic content did not change; only the structure did.

The two levers pull on different metrics. Evidence density raises the chance a cited page is absorbed once retrieved. Structure and retrieval signals raise the chance the page is retrieved and cited in the first place. A programme aimed at only one lever moves only one metric. The gap between the two numbers is itself diagnostic.

FAQ
Frequently Asked Questions

Sources

Sources are tiered per our methodology & sources page.

Key finding

ChatGPT cites around 7 sources per answer; Perplexity and Google AI Overviews cite more. But pages cited by ChatGPT have a much higher average influence on the answer's wording and evidence. Influence rises with page length, structure, and the density of definitions, statistics, comparisons, and step-by-step procedures.

Methodology note

602 controlled prompts run through ChatGPT, Google AI Overview / Gemini, and Perplexity. The researchers analysed 21,143 citations and 18,151 fetched pages, extracting 72 features per citation. They measured citation breadth (how many sources are cited) and citation depth (how much each cited source actually shapes the final answer). The dataset is public.

arXiv·Accessed
Tier A — Strongest evidenceRead source

Don't Measure Once: Measuring Visibility in AI Search (GEO)

University of St. Gallen · Schulte et al. · 2026

Key finding

Argues that single-snapshot AI visibility measurement understates true brand presence in generative search. Proposes a longitudinal measurement framework that captures variation across runs, prompts, and platforms, demonstrating that any one-time snapshot of citation rate or mention rate can swing materially across repeated queries. Stochasticity itself is a measurement parameter, not noise to discard.

Methodology note

arXiv preprint 2604.07585 (April 2026). Position paper proposing a multi-run, multi-prompt evaluation protocol for GEO. Direct fetch on arxiv.org returned the canonical abstract page; PDF body was inaccessible but methodology summary was confirmed through the abstract and the linked DOI.

arXiv·Accessed
Key finding

A structural-engineering framework called GEO-SFE separates content structure into three layers: document architecture, information chunking and visual emphasis. Applied to the same underlying text, the framework lifts citation rates in generative engines by 17.3% on average and subjective answer quality by 18.5% across six mainstream AI search engines. The semantic content itself is preserved; only structure changes.

Methodology note

arXiv paper 2603.29979 by Yu, Yang, Ding and Sato, submitted March 2026. The authors define structural features at macro, meso and micro levels and build predictive models for citation probability that are tuned per engine. They evaluate the framework against six generative engines and report consistent gains in citation rate and quality across configurations.

arXiv·Accessed
Key finding

SAGEO Arena introduces a realistic environment for evaluating search-augmented generative engine optimization, simulating the full pipeline from query through retrieval to answer generation. Empirical tests across published GEO methods show that arena-based evaluation reveals failures that simpler benchmarks miss, particularly under realistic source-distribution drift and adversarial competition.

Methodology note

arXiv preprint 2602.12187 (February 2026). Direct fetch on arxiv.org returned the HTML preprint with the full methodology and arena specification; the released benchmark covers multiple search engines and GEO method variants.

arXiv·Accessed
Key finding

Empirically compares source coverage and citation bias between LLM-based search engines and traditional search. Finds that LLM-based search systematically over-represents large, English-language, US-based sources and under-represents smaller and non-English content compared with what traditional search returns for the same queries. Bias is consistent across the major LLM search providers tested.

Methodology note

arXiv preprint 2512.09483 (December 2025). Direct fetch on arxiv.org returned the abstract page. The paper runs matched queries across LLM-based and traditional search systems and quantifies citation distribution by source size, language, and geography.

arXiv·Accessed
Key finding

Quantifies stochasticity in agentic LLM evaluations using intraclass correlation coefficients (ICC). Shows that single-run evaluations of agentic systems are unreliable because run-to-run variance is large relative to the gap between system variants. Recommends a minimum of 5 to 10 repeated runs per evaluation and reports the ICCs for several common agentic benchmarks.

Methodology note

arXiv preprint 2512.06710 (December 2025). Direct fetch on arxiv.org returned the abstract page. The paper applies the intraclass correlation coefficient framework from psychometrics to LLM agent evaluation and reports ICC values across multiple published benchmarks.

arXiv·Accessed
Tier A — Strongest evidenceRead source

Redefining Retrieval Evaluation in the Era of LLMs

arXiv · 2025

Key finding

Argues that traditional retrieval evaluation metrics (recall, MRR) underestimate the value of retrieval in LLM-based pipelines because LLMs can compensate for partial retrieval through their pre-existing knowledge. Proposes new metrics that measure retrieval value conditional on the LLM's downstream behaviour, finding that some 'high-recall' retrievers are actually worse for LLM-based search.

Methodology note

arXiv preprint 2510.21440 (October 2025). Direct fetch returned the abstract page. The paper introduces conditional retrieval metrics evaluated against standard RAG benchmarks and shows that metric choice changes the relative ranking of common retrieval methods.

arXiv·Accessed
Key finding

Defines citation failure as a measurable phenomenon in RAG systems where retrieved documents are not cited even when they support the answer. Introduces CITECONTROL, a method to detect and mitigate citation failure that improves citation recall without degrading answer quality. The method is lightweight and integrates with standard RAG pipelines.

Methodology note

arXiv preprint 2510.20303 (October 2025). Direct fetch returned the abstract page. The paper introduces a formal definition of citation failure and an empirical benchmark across multiple RAG systems, with CITECONTROL's improvements measured on standard QA datasets.

arXiv·Accessed
Key finding

Proposes sub-sentence-level citations in RAG outputs, where each cited passage is matched to a specific sub-sentence in the generated answer rather than to the answer as a whole. The approach improves attribution precision and reduces over-citation, where models cite a source for an entire sentence even when only part of it is supported.

Methodology note

arXiv preprint 2509.20859 (September 2025). Direct fetch on arxiv.org returned the abstract page; method details and benchmark scores are in the full PDF. Empirical evaluation against standard RAG attribution baselines.

arXiv·Accessed
Key finding

Three on-page properties showed the strongest association with whether a page got cited by AI answer engines: metadata and freshness, semantic HTML markup, and structured data. Pages that scored at least 0.70 on the GEO-16 quality score and met at least 12 of 16 quality pillars were cited at substantially higher rates than pages that did not.

Methodology note

70 product-intent prompts were run across Brave Summary, Google AI Overviews, and Perplexity, producing 1,702 citations across 1,100 unique URLs. The researchers audited each cited page against a 16-pillar framework and used logistic models with domain-clustered standard errors. The study focuses on English-language B2B SaaS pages. Published September 2025.

arXiv·Accessed
Tier A — Strongest evidenceRead source

The Attribution Crisis in LLM Search Results: Estimating Ecosystem Exploitation

O'Reilly Media · Strauss et al. · 2025

Key finding

Argues that the citation behaviour of LLM-based search constitutes an attribution crisis: cited sources are systematically under-credited (fewer click-throughs than equivalent SERP positions), over-extracted (more content reproduced verbatim or near-verbatim), and concentrated on a small subset of high-authority publishers. Quantifies the ecosystem-level economic impact on publishers.

Methodology note

arXiv preprint 2508.00838 (August 2025). Direct fetch on arxiv.org returned the abstract page. The paper combines empirical citation analysis with economic modelling to estimate ecosystem-level effects on publisher revenue and proposes attribution reforms.

arXiv·Accessed
Tier A — Strongest evidenceRead source

Evaluation of Attribution Bias in Generator-Aware Retrieval-Augmented LLMs

University of Amsterdam et al. · Abolghasemi et al. · 2025

Key finding

Identifies attribution bias in generator-aware RAG: when an LLM is told which documents it can cite, the model preferentially attributes claims to documents that align with its own pre-existing beliefs, ignoring contradicting sources even when those contradict the model's output. The bias is measurable and persists across model families.

Methodology note

arXiv preprint 2410.12380 (October 2024). Direct fetch returned the abstract page. The paper develops a controlled experimental setup and runs it across several LLM families; the bias is reported as statistically significant on standard QA tasks.

ACL 2025 Findings·Accessed
Tier A — Strongest evidenceRead source

Correctness is not Faithfulness in RAG Attributions

SIGIR/ICTIR 2025 · Wallat · 2025

Key finding

Shows that a RAG system's answers can be factually correct while its citations are unfaithful, meaning the cited passages do not actually support the generated claim. Across standard benchmarks, correctness and faithfulness diverge measurably, implying that citation-quality evaluation must be a separate metric from answer accuracy in any AI visibility tracking system.

Methodology note

arXiv preprint 2412.18004 (December 2024). Empirical study testing whether RAG answers and their cited evidence are mutually consistent. Direct fetch on arxiv.org confirmed the abstract; the full evaluation uses public attribution datasets and human annotation for faithfulness scoring.

arXiv·Accessed
Key finding

LongCite enables LLMs to generate fine-grained citations in long-context QA by training the model to attribute each statement in its answer to a specific span in the retrieved long document. The method substantially improves citation precision over post-hoc citation generation and outperforms baselines on the released LongCite benchmark.

Methodology note

arXiv preprint 2409.02897 (September 2024). Direct fetch returned the abstract page. The paper releases a training pipeline and benchmark dataset; empirical comparison against post-hoc citation baselines is reported in the PDF.

ACL 2025 Findings·Accessed
Tier A — Strongest evidenceRead source

ReliableEval: A Recipe for Stochastic LLM Evaluation via Method of Moments

arXiv (cs.CL) / EMNLP 2025 · Gili Lior et al. · 2025

Key finding

ReliableEval proposes a method-of-moments recipe for stochastic LLM evaluation that explicitly accounts for run-to-run variance in model outputs. Across standard benchmarks, the method produces tighter confidence intervals than naive averaging and reveals that some headline LLM performance comparisons are within noise margins. Released as an open evaluation toolkit.

Methodology note

arXiv preprint 2505.22169 (May 2025). Direct fetch returned the abstract page. The paper derives the method-of-moments estimator, tests it against several common evaluation tasks, and releases the toolkit for community use.

arXiv·Accessed
Tier A — Strongest evidenceRead source

GEO: Generative Engine Optimization

Princeton University / Georgia Tech / Allen Institute for AI / IIT Delhi · Pranjal Aggarwal et al. · 2024

Key finding

Adding citations, quotations, and statistics to content can increase its visibility in AI-generated answers by up to 41% on average. Pages ranked outside the top of traditional search saw the largest gains. The effect varies by content domain and by AI engine, but the lift from evidence-style content elements is consistent across the conditions tested.

Methodology note

10,000 questions were run through generative search engines. The researchers compared answers before and after applying nine content optimisation strategies, including citations, quotations, statistics, and authoritative language. They measured visibility as the share of the AI answer attributable to the optimised page, using both word position and word count metrics. Peer-reviewed at KDD 2024.

arXiv / KDD 2024·Accessed
Key finding

Across 680 million tracked citations between August 2024 and June 2025, the three big AI engines source very differently. Wikipedia is ChatGPT's top source at 7.8% of citations and 47.9% of its top-10 sources. Reddit leads on Perplexity (6.6% of all citations, 46.7% of top-10) and Google AI Overviews (2.2%). Around 80% of cited URLs sit on .com domains.

Methodology note

Profound analysed citations collected by its monitoring platform across ChatGPT, Google AI Overviews and Perplexity from August 2024 to June 2025. The post reports two cuts of the same data: share of total citations (per platform) and share of each platform's top 10 most-cited sources. Top-level domain distribution is broken out separately. Source lists are published in the article.

Profound·Accessed

About the author Max Ackermann

Max Ackermann is founder and Managing Director of info.link, the product data platform that makes brands visible in AI search and connects every physical product to the web through GS1 Digital Link. He writes about AI search and generative engine optimization (GEO), AI-powered commerce, and how brands can structure product data for ChatGPT, Gemini, Perplexity, and retailer AI assistants like Amazon Rufus. For the past two years he has built the pipelines that put structured product data into AI answers, and run the experiments that test what actually moves AI citations.

Max has 20+ years of experience building digital products and businesses. He previously led McKinsey's Corporate Venture and Design teams across Europe, and as Managing Director of a leading US digital agency he built platforms with Nike, Google, Meta, and Airbnb. He founded the UX Design program at Central Saint Martins College, University of the Arts London, and is a Fellow of the UK's Higher Education Academy. Based in Hamburg, he works closely with GS1 on Digital Link adoption; info.link is headquartered in Hamburg and Berlin and counts GS1 Germany among its investors.

Follow Max on LinkedIn.

Interested?

From compliant digital labels to AI-verified product answers, we help leading brands ensure their products are visible and accurately represented everywhere consumers look. Book your free consultation and demo.

digital label preview
digital label preview
digital label preview
Mentions, citations, and absorption: AI's 3 metrics | info.link