Definitions of key terms used across SIGI research papers on Generative Engine Optimization, LLM citation behaviour, and AI source selection methodology.
The practice of structuring digital content so that generative AI systems are more likely to surface, cite, and accurately represent it in their responses. GEO encompasses content architecture, semantic clarity, schema markup, and technical implementation strategies. Unlike traditional SEO which targets search engine ranking algorithms, GEO targets the retrieval and synthesis mechanisms within large language models.
A neural network trained on vast text corpora that generates human-like responses by predicting token sequences. In GEO research, LLMs are the primary systems under study, as their internal source selection mechanisms determine which content gets cited in AI-generated responses. Major commercial LLMs include GPT-4, Gemini, Claude, and the models powering Perplexity.
An architecture that supplements an LLM's parametric knowledge by retrieving external documents at inference time and injecting them into the generation context. RAG is the primary mechanism through which AI search engines like Perplexity and Bing Chat provide cited, up-to-date responses. Understanding RAG pipelines is essential to GEO because content must be retrievable before it can be cited.
The act of a generative AI system attributing information in its response to a specific external source. AI citations differ from academic citations in that they are generated algorithmically based on retrieval confidence, source authority signals, and contextual relevance. Citation behaviour varies significantly across LLM platforms and query types.
Any attribute of a content source that influences an LLM's likelihood of selecting it for citation. Trust signals include domain authority, content freshness, schema markup, cross-domain corroboration, author credentials, and publication provenance. SIGI research investigates which trust signals carry the most weight across different LLM platforms.
A precisely designed query submitted to an LLM to test a specific hypothesis about its citation or response behaviour. Controlled probes are a core element of SIGI's research methodology, allowing researchers to isolate individual variables while holding other factors constant. Each probe is documented with its exact wording, target model, and expected versus actual outcomes.
The point at which the overall sentiment polarity of content about an entity begins to influence LLM citation behaviour. SIGI research has observed that LLMs exhibit sensitivity to aggregate sentiment, sometimes suppressing or amplifying sources based on the emotional tone of surrounding content. The exact threshold varies by model and context.
A ranked classification of evidence types used in GEO research, from controlled experiments at the top through observational studies and case analyses to anecdotal reports at the bottom. SIGI employs an evidence hierarchy to ensure that research claims are proportionate to the strength of supporting evidence. This framework helps distinguish between correlation and causation in citation behaviour studies.
A conditional decision point in SIGI's research framework that determines whether an observed correlation merits further causal investigation. Logic gates apply structured criteria such as consistency across models, temporal stability, and effect size before a pattern advances from observation to hypothesis. This prevents premature causal claims in GEO research.
An unmeasured or uncontrolled variable that correlates with both the independent and dependent variables in a study, potentially producing a spurious association. In GEO research, common confounds include domain authority, content age, and topic popularity. SIGI's methodology explicitly identifies and controls for confounding variables through experimental design and statistical controls.
A hypothetical scenario used in causal reasoning to determine what would have happened if a specific variable had been different. In SIGI's research, counterfactual analysis helps establish whether a content change actually caused a change in citation behaviour, or whether the outcome would have occurred regardless. This is essential for moving beyond correlation in GEO studies.
An Insufficient but Necessary part of an Unnecessary but Sufficient condition for a given outcome. In GEO research, INUS analysis recognises that citation by an LLM rarely has a single cause; instead, multiple factors combine in different configurations. A factor like schema markup may be an INUS condition for citation: necessary within certain causal bundles but not sufficient on its own.
A causal classification framework used in SIGI research. A necessary condition must be present for an outcome to occur. A sufficient condition guarantees the outcome by itself. A contributory condition increases the probability of the outcome without being strictly required. Most GEO factors are contributory rather than necessary or sufficient, meaning they improve citation likelihood without guaranteeing it.
The concentration of clearly defined, semantically distinct entities within a piece of content. Higher entity density can improve an LLM's ability to parse and retrieve specific factual claims from the content. SIGI research examines how entity density interacts with other structural factors to influence citation likelihood across different query types.
A content structure pattern where the core definition or conclusion appears in the opening sentence of a section, following the Bottom Line Up Front principle. SIGI research indicates that definitional leads correlate with higher citation rates because LLMs can extract and attribute key claims more easily when they appear at the beginning of a content block rather than buried within longer passages.
Structured data embedded in a webpage's HTML using JSON-LD format that explicitly declares the type, properties, and relationships of content to search engines and AI systems. Schema markup based on the Schema.org vocabulary helps LLMs interpret content with higher confidence. Common GEO-relevant schema types include ScholarlyArticle, FAQPage, DefinedTerm, and Organization.
A Schema.org structured data type that marks up question-and-answer content on a webpage. FAQPage schema is significant in GEO because it maps directly to how users query LLMs, and AI systems can extract Q&A pairs with high fidelity when they are explicitly declared. This schema type enables content to appear in both traditional rich results and AI-generated responses.
A Schema.org structured data type used to mark up academic and research publications with metadata including authors, publication date, abstract, citation references, and institutional affiliation. ScholarlyArticle schema signals to AI systems that the content is research-grade, which may influence its weighting in citation decisions for knowledge-intensive queries.
A plain-text file placed at the root of a website that provides structured context about the site, its purpose, and its key content specifically for consumption by large language models. Similar in concept to robots.txt but designed to inform rather than restrict. The llms.txt file helps LLMs understand the authority, scope, and structure of a domain when making retrieval and citation decisions.
A standard protocol file that instructs web crawlers which pages to access or avoid. In the AI context, robots.txt has gained new significance as LLM training pipelines and AI search crawlers (such as GPTBot, Google-Extended, and ClaudeBot) use it to determine which content they can index. Misconfigured robots.txt can inadvertently block AI systems from accessing citable content.
A protocol that allows websites to notify participating search engines instantly when content is created, updated, or deleted. In GEO contexts, IndexNow can accelerate the inclusion of new or updated content in AI search retrieval indexes. Supported by Bing and Yandex, IndexNow reduces the latency between content publication and potential AI citation.
A content strategy that structures information in two parallel layers: one optimised for human readability and engagement, and another optimised for machine parsing and AI retrieval. The dual-layer approach uses visible content for human audiences while embedding structured data, schema markup, and machine-readable metadata for AI systems. SIGI research examines how this architecture affects citation rates.
The structural organisation of information within and across web pages, including heading hierarchies, internal linking patterns, topic clustering, and semantic relationships between content units. In GEO, content architecture determines how effectively AI systems can navigate, parse, and extract citable information from a website. Well-architected content is more likely to be retrieved and accurately cited.
A heading structured as a natural-language question, mirroring the way users query AI systems. Question-format headings create a direct semantic match between user prompts and content sections, which can improve retrieval relevance in RAG pipelines. SIGI research investigates whether question-format headings produce measurably higher citation rates compared to declarative or keyword-based headings.
A self-contained content block that provides a complete answer to a specific question without requiring the reader or AI system to consult surrounding context. Information islands are designed to be extractable and citable in isolation. SIGI research suggests that content structured as discrete information islands may be cited more reliably than content that depends on broader page context for comprehension.
The phenomenon where an entity's credibility in one domain or on one platform reinforces its perceived authority across other domains and platforms. In GEO, cross-domain authority manifests when consistent entity information across multiple authoritative sources increases an LLM's confidence in citing that entity. Schema properties like sameAs help establish cross-domain linkages.
A Schema.org property that links an entity to its canonical representations on other platforms and domains, such as Wikipedia, LinkedIn, Wikidata, or official websites. The sameAs property helps AI systems resolve entity identity across the web, increasing confidence that references to the same name on different platforms refer to the same entity. This disambiguation supports stronger citation authority.
The tendency of LLMs to give disproportionate weight to information that appears early in a document, retrieved context window, or training sequence. Primacy bias has implications for GEO because content positioned at the beginning of a page or section may be more likely to be selected for citation. SIGI research tests for primacy bias as a potential confound in content architecture studies.
An observed pattern in SIGI research where both very low and very high price points correlate with increased LLM citation confidence for commercial entities, while mid-range prices produce weaker trust signals. The U-curve suggests that LLMs may associate pricing extremes with either authoritative premium positioning or well-known value brands, while mid-range pricing lacks distinctive signalling.
The length of time a domain has been registered and active on the web. SIGI research has investigated domain age as a potential trust signal for LLM citation and found it to be a null signal, meaning it does not independently predict citation likelihood when other factors are controlled for. This finding challenges the common assumption that older domains inherently carry more authority in AI systems.
These terms are used throughout our published research papers. Read our studies to see how these concepts are applied in practice.