Every finding SIGI publishes is subjected to a structured sequence of logical tests before it earns a causal or correlational label. We never claim more than the evidence supports. We classify every result by evidence level. We publish null results. This page describes the public-facing framework that governs all of our research.
Not all evidence is equal. A single observation does not carry the same weight as a controlled experiment replicated across platforms. We use a seven-level evidence hierarchy — adapted from established research methodology — to classify every finding. The level determines the language we are permitted to use when describing a result.
| Level | Evidence Type | Description | Permitted Language | Prohibited Language |
|---|---|---|---|---|
| 7 | Meta-Analysis | Systematic synthesis across multiple controlled studies | Causes, establishes, demonstrates | — |
| 6 | Controlled Experiment (Replicated) | Isolation probe replicated across platforms and time periods | Strong evidence that, reliably produces | Proves, definitively |
| 5 | Controlled Experiment (Single) | Single-variable isolation with controlled conditions | Evidence suggests, appears to cause | Proves, always, universally |
| 4 | Quasi-Experiment | Comparison with partial control over variables | Associated with, linked to | Causes, proves, demonstrates |
| 3 | Observational Study | Systematic observation without variable manipulation | Observed pattern, correlates with | Causes, leads to, produces |
| 2 | Case Study | Detailed analysis of a specific instance | In this case, this instance shows | Generally, always, causes |
| 1 | Anecdote | Single unrepeated observation or report | We observed, one instance noted | Shows, suggests, indicates, causes |
This hierarchy is not decorative. It is enforced. If a finding has only been observed once without controls, it is classified as Level 1 or 2 regardless of how compelling the result appears. The language constraint prevents premature causal claims from entering our publications.
Before any causal or correlational claim is published, it must pass through all seven logic gates. A failure at any gate means the claim is either downgraded, reclassified, or rejected. The gates are applied sequentially — each one builds on the output of the previous.
Every claim is first reduced to its formal logical structure: "If X, then Y." This gate tests whether the argument is structurally valid before any empirical evidence is considered. Claims that conflate correlation with causation, commit affirming-the-consequent errors, or contain hidden premises are caught here. If the logical form is invalid, the claim does not proceed.
This gate asks: could a co-varying variable explain the observed result? When a property change appears to affect LLM behaviour, we systematically identify other variables that may have changed simultaneously. If confounds cannot be ruled out, the finding is reclassified from causal to correlational, and the confounding variables are documented for future isolation testing.
Not all causal relationships are the same. This gate classifies the role of the variable under study. Is it necessary (the effect never occurs without it), sufficient (it alone produces the effect), or contributory (it increases the likelihood but is neither necessary nor sufficient)? Most findings in LLM citation research fall into the contributory category. Mislabelling a contributory factor as necessary or sufficient is a classification error this gate prevents.
The counterfactual gate tests the inverse: what happens when the variable is absent? If adding structured data correlates with increased citation, this gate asks whether removing structured data produces a measurable decrease. A finding that passes the counterfactual gate in both directions — presence increases, absence decreases — earns a stronger evidence classification than one that only demonstrates the positive direction.
This gate requires the researcher to systematically generate and evaluate competing explanations for the observed result. For every proposed cause, at least three plausible alternatives must be identified and tested or ruled out. The goal is not to confirm the preferred explanation but to genuinely attempt to disconfirm it. Findings survive this gate only when alternatives have been rigorously eliminated.
A finding observed once is preliminary. This gate tests whether the result can be reproduced across different conditions: different time periods, different LLM platforms, different query phrasings, and different source content. Replication across two or more dimensions is required to move a finding above a preliminary confidence rating. Failures to replicate are documented and published as null results.
The final gate asks: is there a plausible mechanism by which the LLM could access the signal being tested? If a proposed cause has no pathway through which the model could detect or process it, the claim is suspect regardless of the observed correlation. This gate distinguishes between variables that are genuinely in the model's signal path and those that merely co-occur with something that is.
The core experimental unit in our research is the isolation probe. The principle is straightforward: change exactly one variable, hold everything else constant, and measure the resulting change in LLM behaviour — whether that is citation frequency, sentiment, source ranking, or content selection.
This approach is necessary because LLM citation behaviour is influenced by dozens of variables simultaneously. Without strict isolation, it is impossible to attribute an observed effect to a specific cause. A page that adds schema markup and rewrites its headings and updates its publication date has changed three variables. If citation frequency increases, which variable produced the effect? Isolation probes answer that question by testing each variable independently.
Our probes measure changes in both sentiment (how an LLM characterises a source) and citation response (whether and how a source is referenced). The combination of these two dimensions provides a more complete picture than either metric alone.
A finding that holds on one LLM platform but fails on others is platform-specific, not general. Our research programme tests across multiple major LLM platforms to distinguish universal citation behaviours from platform-specific artefacts. The specific platforms tested are documented in each publication rather than fixed in our methodology, because the commercially significant platforms change over time.
Cross-platform testing serves as a natural replication mechanism. When a result replicates across architecturally different models from different providers, the finding earns a higher confidence rating than one observed on a single platform.
Every finding published by SIGI carries an explicit confidence rating. The rating is determined by the finding's replication status, the number of logic gates it has passed, and the evidence level it has achieved. There are four tiers.
Observed in initial testing but not yet replicated. May be based on a single platform, a single time period, or a small sample. Published to document the observation but not suitable for decision-making. Subject to reclassification or retraction as further testing is conducted.
Replicated at least once across a different condition (different platform, time period, or query set). Has passed the core logic gates but may have unresolved confounds or limited counterfactual testing. Directionally reliable but specific magnitudes may shift.
Replicated across multiple platforms and time periods. Has passed all seven logic gates. Confounds have been identified and controlled for. Suitable for informing strategy with the understanding that edge cases or platform-specific variations may exist.
Extensively replicated. Survived counterfactual testing in both directions. Mechanism identified and validated. Part of a broader evidence pattern supported by multiple independent studies. The strongest classification we assign — and the rarest.
Most claims about LLM behaviour in the market today are anecdotal. Someone changes a website, observes a different result in ChatGPT, and publishes a conclusion. That is Level 1 evidence — an unrepeated anecdote. It may be true, but it has not been tested.
Our methodology exists to close that gap. Three principles distinguish our approach:
We never claim more than the evidence supports. The evidence hierarchy and language constraints mean a finding cannot be described as causal until it has earned that classification through controlled, replicated testing. This is not a stylistic choice — it is a structural rule enforced at every stage of the publication process.
We publish null results. When a variable we expected to matter shows no measurable effect, we publish that finding. The absence of an effect is as informative as its presence. Omitting null results distorts the evidence base and creates survivorship bias in the literature.
We classify every finding by evidence level. Readers of our publications always know whether they are looking at a preliminary single-platform observation or a replicated, mechanism-validated finding. The confidence rating is not buried in footnotes — it is a primary feature of every published result.
We publish raw datasets so others can verify our analysis. We document our methodology in sufficient detail for independent replication. We report null results alongside positive findings. We classify every result by evidence level and confidence rating so readers can assess the strength of the evidence for themselves.
Raw datasets and experimental records are available in our Research section. Analysed findings with full methodology documentation are published in Publications.
Collaborate With UsSIGI's research programme is designed to systematically map how large language models select, cite, and surface information. The programme spans 100 planned research papers across multiple categories including citation mechanics, source authority signals, content structure effects, temporal dynamics, and cross-platform behavioural differences.
Each paper follows the Logic-First Research Methodology described on this page. Findings are published on a rolling basis as studies are completed, with each publication carrying its evidence level classification and confidence rating. The full catalogue of published and in-progress research is available in our Publications section.