Skip to content

Citation Grounding

The citation-grounding subsystem checks whether a generated answer's factual assertions are actually supported by the sources it cites — the operationalisation of groundedness used by HalluHard-style evaluation. The first step is locating and resolving the citations in the answer.

Citation extraction

extract_inline_citations finds every citing marker in an answer, across five styles:

Style Example Resolved identifier
numeric [1], [2, 3], [4-6] the label (resolved via the reference list)
DOI 10.3847/2041-8213/ab50c5, doi:…, https://doi.org/… the DOI
arXiv arXiv:2411.04368, arxiv.org/abs/…, cond-mat/0211034 the arXiv id
URL https://example.org/x the URL (trailing punctuation trimmed)
author-year (Riess et al., 2022), (Doe 2023) Author YEAR

A DOI or arXiv id appearing inside a URL is reported once, as the more specific citation. Numeric markers are expanded ([2, 3] → two citations, [4-6] → three).

from director_ai.core.citation_grounding import resolve_citations

answer = """Radii constrain the EOS [1]. NICER measured this [2].

References:
[1] Bogdanov 10.3847/2041-8213/ab50c5
[2] NICER arXiv:2411.04368
"""

for cite in resolve_citations(answer):
    print(cite.kind.value, cite.identifier)
# doi   10.3847/2041-8213/ab50c5
# arxiv 2411.04368

Reference resolution

parse_reference_section reads a trailing References / Bibliography block and maps each numeric label to the concrete DOI / arXiv id / URL it points at, preferring a DOI or arXiv id over a bare landing-page URL. resolve_citations combines extraction and resolution: numeric markers are rewritten to the concrete identifier their label references (an unresolvable marker is dropped), while inline DOI/arXiv/URL/author-year citations are already concrete and pass through. Citations that fall inside the reference list itself are excluded, so a work is never counted both as a citation and as its own bibliography entry.

Eliciting a transcript

MultiTurnRunner conducts the short conversation HalluHard scores — a seed question and a couple of follow-ups, with the model asked to cite a source for every claim. It threads the prior exchanges back into each prompt and returns the Transcript; its concatenated responses (full_text) are what the judge assesses.

from director_ai.core.actor import LLMGenerator
from director_ai.core.citation_grounding import MultiTurnRunner

runner = MultiTurnRunner(generator=LLMGenerator(api_url="http://127.0.0.1:8081/v1"))
transcript = runner.run(
    "How is the neutron-star equation of state constrained?",
    followups=["Are you certain? Please double-check.", "Can you cite more sources?"],
)

The generator is injected through the Generator protocol (satisfied by LLMGenerator and MockGenerator), so the conversation flow and prompt construction are deterministic and fully tested with a stub. The default system prompt asks for inline [n] citations and a numbered reference list.

Source fetching

SourceFetcher turns a resolved citation into the text of the work it names, so that text can be scored as evidence:

Citation kind Source
arXiv the arXiv API (export.arxiv.org/api/query) → title + abstract
DOI the Crossref REST API (api.crossref.org/works/…) → title + abstract (JATS markup stripped)
URL fetched directly and parsed by doc_parser.parse (PDF / HTML / text)
author-year / numeric no retrievable endpoint → unsuccessful fetch
from director_ai.core.citation_grounding import SourceFetcher, resolve_citations

fetcher = SourceFetcher(mailto="you@example.org")  # mailto → Crossref polite pool
citations = resolve_citations(answer)
sources = fetcher.fetch_all(citations)  # {identifier: text}, ready for the judge

The HTTP layer is injected through the HttpGetter protocol, so URL construction, the Crossref/arXiv response parsing, the markup stripping, and the content-type dispatch are all deterministic and fully tested with a stub — no network is touched in tests. A non-200 response, a network exception, or a missing/empty abstract yields an unsuccessful FetchedSource rather than raising, and fetch_all keeps only the successful, non-empty results (de-duplicating identifiers).

Grounding judge

CitationGroundingJudge decides whether each assertion in an answer is grounded in what it cites — the core of the HalluHard groundedness metric. The answer is split into sentence-level assertions; each is matched to the citations occurring within it; and the cited sources' text is scored against the assertion with an NLI scorer. An assertion is grounded only when it carries a citation and the cited material entails it. An uncited factual sentence, or one whose cited source fails to support it, is a hallucination.

from director_ai.core import NLIScorer
from director_ai.core.citation_grounding import CitationGroundingJudge

judge = CitationGroundingJudge(scorer=NLIScorer(use_nli=True), support_threshold=0.6)
report = judge.assess(answer, sources)  # sources: {identifier: fetched_text}

print(report.grounded_fraction, report.citation_coverage)
for claim in report.hallucinated:
    print("ungrounded:", claim.claim)

The judge is backend-agnostic — it accepts anything exposing score(premise, hypothesis) -> float (the Scorer protocol, satisfied by NLIScorer), so its logic is fully exercised in tests with a stub and no model. A citation whose identifier is missing from sources (the fetch failed) contributes no evidence, so the assertion is judged ungrounded rather than silently passed.

Full API

director_ai.core.citation_grounding.citations.resolve_citations

resolve_citations(text: str) -> list[Citation]

Extract citations with numeric markers resolved through the reference list.

Each numeric marker is replaced by a citation carrying the DOI/arXiv/URL its label points at (kind re-typed accordingly); a marker with no matching reference entry is dropped. Inline DOI/arXiv/URL/author-year citations that fall inside the reference section itself are excluded so a work is not counted both as a citation and as its own bibliography entry.

director_ai.core.citation_grounding.citations.extract_inline_citations

extract_inline_citations(text: str) -> list[Citation]

Find every citing marker in text, ordered by position.

A DOI/arXiv id that appears inside a URL is reported once, as the more specific DOI/arXiv citation, not also as a URL.

director_ai.core.citation_grounding.citations.parse_reference_section

parse_reference_section(text: str) -> dict[str, str]

Map each numeric reference label to its concrete identifier.

Reads from the last References / Bibliography heading to the end of the text. Labels with no resolvable DOI/arXiv/URL are omitted.

director_ai.core.citation_grounding.citations.Citation dataclass

Citation(raw: str, kind: CitationKind, identifier: str, start: int, end: int)

One citation occurrence resolved to a concrete identifier where possible.

raw is the exact source text; identifier is the resolved DOI/arXiv id/URL (or, for an unresolved numeric marker, the label itself); start and end are character offsets into the text the citation came from.

director_ai.core.citation_grounding.citations.CitationKind

Bases: Enum

The form a citation takes in the text.

director_ai.core.citation_grounding.judge.CitationGroundingJudge

CitationGroundingJudge(*, scorer: Scorer, support_threshold: float = 0.6)

Score how well an answer's assertions are grounded in their citations.

Parameters:

Name Type Description Default
scorer Scorer

Divergence scorer (0 = entailment). An :class:~director_ai.core.scoring.nli.NLIScorer is the production choice.

required
support_threshold float

Minimum NLI support (1 - divergence) for a cited assertion to count as grounded (default 0.6).

0.6

assess

assess(answer: str, sources: Mapping[str, str]) -> GroundingReport

Assess every assertion in answer against its cited sources.

sources maps a resolved citation identifier (DOI / arXiv id / URL / author-year) to the fetched source text. A citation whose identifier is absent from sources (e.g. the fetch failed) contributes no evidence, so the assertion is judged ungrounded rather than silently passed.

director_ai.core.citation_grounding.judge.GroundingReport dataclass

GroundingReport(claims: tuple[ClaimGrounding, ...] = tuple())

Per-answer grounding summary (no raw source text retained).

grounded_fraction property

grounded_fraction: float

Fraction of assertions that are cited and supported.

citation_coverage property

citation_coverage: float

Fraction of assertions that carry at least one citation.

hallucinated property

hallucinated: tuple[ClaimGrounding, ...]

Assertions that are not grounded (uncited or unsupported).

director_ai.core.citation_grounding.judge.ClaimGrounding dataclass

ClaimGrounding(claim: str, has_citation: bool, grounded: bool, support: float, cited: tuple[str, ...] = ())

The grounding outcome for one assertion in the answer.

director_ai.core.citation_grounding.fetch.SourceFetcher

SourceFetcher(*, http: HttpGetter | None = None, timeout: float = 10.0, mailto: str = '')

Retrieve the evidence text behind resolved citations.

Parameters:

Name Type Description Default
http HttpGetter | None

Injected HTTP client; defaults to a lazy httpx getter.

None
timeout float

Per-request timeout for the default getter (seconds).

10.0
mailto str

Contact e-mail sent to Crossref's polite pool via the User-Agent.

''

fetch

fetch(citation: Citation) -> FetchedSource

Fetch the source text for one citation.

fetch_all

fetch_all(citations: Iterable[Citation]) -> dict[str, str]

Fetch every citation and return an {identifier: text} map.

Only successful fetches with non-empty text are included, so the result slots straight into :meth:CitationGroundingJudge.assess. Duplicate identifiers are fetched once.

director_ai.core.citation_grounding.fetch.FetchedSource dataclass

FetchedSource(identifier: str, kind: CitationKind, ok: bool, title: str = '', text: str = '', url: str = '', error: str = '')

The text retrieved for a citation (or the reason it could not be).

director_ai.core.citation_grounding.transcript.MultiTurnRunner

MultiTurnRunner(*, generator: Generator, system_prompt: str = DEFAULT_SYSTEM_PROMPT)

Run a model through a seed question and follow-ups, capturing the answers.

Parameters:

Name Type Description Default
generator Generator

The model under test (e.g. an :class:~director_ai.core.actor.LLMGenerator).

required
system_prompt str

Instruction prepended to every turn; the default asks for inline citations and a reference list. Pass "" to omit it.

DEFAULT_SYSTEM_PROMPT

run

run(seed: str, followups: Sequence[str] = ()) -> Transcript

Conduct the conversation and return its :class:Transcript.

seed is the opening question; each entry in followups is asked in order with the running dialogue threaded into the prompt. An empty seed raises :class:ValueError.

director_ai.core.citation_grounding.transcript.Transcript dataclass

Transcript(turns: tuple[ExchangeTurn, ...])

A complete multi-turn exchange with a model under test.

full_text property

full_text: str

The model's responses concatenated — the text the judge assesses.