Citation Grounding¶

The citation-grounding subsystem checks whether a generated answer's factual assertions are actually supported by the sources it cites — the operationalisation of groundedness used by HalluHard-style evaluation. The first step is locating and resolving the citations in the answer.

Citation extraction¶

extract_inline_citations finds every citing marker in an answer, across five styles:

Style	Example	Resolved identifier
numeric	`[1]`, `[2, 3]`, `[4-6]`	the label (resolved via the reference list)
DOI	`10.3847/2041-8213/ab50c5`, `doi:…`, `https://doi.org/…`	the DOI
arXiv	`arXiv:2411.04368`, `arxiv.org/abs/…`, `cond-mat/0211034`	the arXiv id
URL	`https://example.org/x`	the URL (trailing punctuation trimmed)
author-year	`(Riess et al., 2022)`, `(Doe 2023)`	`Author YEAR`

A DOI or arXiv id appearing inside a URL is reported once, as the more specific citation. Numeric markers are expanded ([2, 3] → two citations, [4-6] → three).

from director_ai.core.citation_grounding import resolve_citations

answer = """Radii constrain the EOS [1]. NICER measured this [2].

References:
[1] Bogdanov 10.3847/2041-8213/ab50c5
[2] NICER arXiv:2411.04368
"""

for cite in resolve_citations(answer):
    print(cite.kind.value, cite.identifier)
# doi   10.3847/2041-8213/ab50c5
# arxiv 2411.04368

Reference resolution¶

parse_reference_section reads a trailing References / Bibliography block and maps each numeric label to the concrete DOI / arXiv id / URL it points at, preferring a DOI or arXiv id over a bare landing-page URL. resolve_citations combines extraction and resolution: numeric markers are rewritten to the concrete identifier their label references (an unresolvable marker is dropped), while inline DOI/arXiv/URL/author-year citations are already concrete and pass through. Citations that fall inside the reference list itself are excluded, so a work is never counted both as a citation and as its own bibliography entry.

Eliciting a transcript¶

MultiTurnRunner conducts the short conversation HalluHard scores — a seed question and a couple of follow-ups, with the model asked to cite a source for every claim. It threads the prior exchanges back into each prompt and returns the Transcript; its concatenated responses (full_text) are what the judge assesses.

from director_ai.core.actor import LLMGenerator
from director_ai.core.citation_grounding import MultiTurnRunner

runner = MultiTurnRunner(generator=LLMGenerator(api_url="http://127.0.0.1:8081/v1"))
transcript = runner.run(
    "How is the neutron-star equation of state constrained?",
    followups=["Are you certain? Please double-check.", "Can you cite more sources?"],
)

The generator is injected through the Generator protocol (satisfied by LLMGenerator and MockGenerator), so the conversation flow and prompt construction are deterministic and fully tested with a stub. The default system prompt asks for inline [n] citations and a numbered reference list.

Source fetching¶

SourceFetcher turns a resolved citation into the text of the work it names, so that text can be scored as evidence:

Citation kind	Source
arXiv	the arXiv API (`export.arxiv.org/api/query`) → title + abstract
DOI	the Crossref REST API (`api.crossref.org/works/…`) → title + abstract (JATS markup stripped)
URL	fetched directly and parsed by `doc_parser.parse` (PDF / HTML / text)
author-year / numeric	no retrievable endpoint → unsuccessful fetch

from director_ai.core.citation_grounding import SourceFetcher, resolve_citations

fetcher = SourceFetcher(mailto="you@example.org")  # mailto → Crossref polite pool
citations = resolve_citations(answer)
sources = fetcher.fetch_all(citations)  # {identifier: text}, ready for the judge

The HTTP layer is injected through the HttpGetter protocol, so URL construction, the Crossref/arXiv response parsing, the markup stripping, and the content-type dispatch are all deterministic and fully tested with a stub — no network is touched in tests. A non-200 response, a network exception, or a missing/empty abstract yields an unsuccessful FetchedSource rather than raising, and fetch_all keeps only the successful, non-empty results (de-duplicating identifiers).

Grounding judge¶

CitationGroundingJudge decides whether each assertion in an answer is grounded in what it cites — the core of the HalluHard groundedness metric. The answer is split into sentence-level assertions; each is matched to the citations occurring within it; and the cited sources' text is scored against the assertion with an NLI scorer. An assertion is grounded only when it carries a citation and the cited material entails it. An uncited factual sentence, or one whose cited source fails to support it, is a hallucination.

from director_ai.core import NLIScorer
from director_ai.core.citation_grounding import CitationGroundingJudge

judge = CitationGroundingJudge(scorer=NLIScorer(use_nli=True), support_threshold=0.6)
report = judge.assess(answer, sources)  # sources: {identifier: fetched_text}

print(report.grounded_fraction, report.citation_coverage)
for claim in report.hallucinated:
    print("ungrounded:", claim.claim)

The judge is backend-agnostic — it accepts anything exposing score(premise, hypothesis) -> float (the Scorer protocol, satisfied by NLIScorer), so its logic is fully exercised in tests with a stub and no model. A citation whose identifier is missing from sources (the fetch failed) contributes no evidence, so the assertion is judged ungrounded rather than silently passed.

Full API¶

director_ai.core.citation_grounding.citations.resolve_citations ¶

resolve_citations(text: str) -> list[Citation]

Extract citations with numeric markers resolved through the reference list.

Each numeric marker is replaced by a citation carrying the DOI/arXiv/URL its label points at (kind re-typed accordingly); a marker with no matching reference entry is dropped. Inline DOI/arXiv/URL/author-year citations that fall inside the reference section itself are excluded so a work is not counted both as a citation and as its own bibliography entry.

director_ai.core.citation_grounding.citations.extract_inline_citations ¶

extract_inline_citations(text: str) -> list[Citation]

Find every citing marker in text, ordered by position.

A DOI/arXiv id that appears inside a URL is reported once, as the more specific DOI/arXiv citation, not also as a URL.

director_ai.core.citation_grounding.citations.parse_reference_section ¶

parse_reference_section(text: str) -> dict[str, str]

Map each numeric reference label to its concrete identifier.

Reads from the last References / Bibliography heading to the end of the text. Labels with no resolvable DOI/arXiv/URL are omitted.

director_ai.core.citation_grounding.citations.Citation `dataclass` ¶

Citation(raw: str, kind: CitationKind, identifier: str, start: int, end: int)

One citation occurrence resolved to a concrete identifier where possible.

raw is the exact source text; identifier is the resolved DOI/arXiv id/URL (or, for an unresolved numeric marker, the label itself); start and end are character offsets into the text the citation came from.

director_ai.core.citation_grounding.citations.CitationKind ¶

Bases: Enum

The form a citation takes in the text.

director_ai.core.citation_grounding.judge.CitationGroundingJudge ¶

CitationGroundingJudge(*, scorer: Scorer, support_threshold: float = 0.6)

Score how well an answer's assertions are grounded in their citations.

Parameters:

Name	Type	Description	Default
`scorer`	`Scorer`	Divergence scorer (`0` = entailment). An :class:`~director_ai.core.scoring.nli.NLIScorer` is the production choice.	required
`support_threshold`	`float`	Minimum NLI support (`1 - divergence`) for a cited assertion to count as grounded (default 0.6).	`0.6`

assess ¶

assess(answer: str, sources: Mapping[str, str]) -> GroundingReport

Assess every assertion in answer against its cited sources.

sources maps a resolved citation identifier (DOI / arXiv id / URL / author-year) to the fetched source text. A citation whose identifier is absent from sources (e.g. the fetch failed) contributes no evidence, so the assertion is judged ungrounded rather than silently passed.

director_ai.core.citation_grounding.judge.GroundingReport `dataclass` ¶

GroundingReport(claims: tuple[ClaimGrounding, ...] = tuple())

Per-answer grounding summary (no raw source text retained).

grounded_fraction `property` ¶

grounded_fraction: float

Fraction of assertions that are cited and supported.

citation_coverage `property` ¶

citation_coverage: float

Fraction of assertions that carry at least one citation.

hallucinated `property` ¶

hallucinated: tuple[ClaimGrounding, ...]

Assertions that are not grounded (uncited or unsupported).

director_ai.core.citation_grounding.judge.ClaimGrounding `dataclass` ¶

ClaimGrounding(claim: str, has_citation: bool, grounded: bool, support: float, cited: tuple[str, ...] = ())

The grounding outcome for one assertion in the answer.

director_ai.core.citation_grounding.fetch.SourceFetcher ¶

SourceFetcher(*, http: HttpGetter | None = None, timeout: float = 10.0, mailto: str = '')

Retrieve the evidence text behind resolved citations.

Parameters:

Name	Type	Description	Default
`http`	`HttpGetter \| None`	Injected HTTP client; defaults to a lazy `httpx` getter.	`None`
`timeout`	`float`	Per-request timeout for the default getter (seconds).	`10.0`
`mailto`	`str`	Contact e-mail sent to Crossref's polite pool via the User-Agent.	`''`

fetch ¶

fetch(citation: Citation) -> FetchedSource

Fetch the source text for one citation.

fetch_all ¶

fetch_all(citations: Iterable[Citation]) -> dict[str, str]

Fetch every citation and return an {identifier: text} map.

Only successful fetches with non-empty text are included, so the result slots straight into :meth:CitationGroundingJudge.assess. Duplicate identifiers are fetched once.

director_ai.core.citation_grounding.fetch.FetchedSource `dataclass` ¶

FetchedSource(identifier: str, kind: CitationKind, ok: bool, title: str = '', text: str = '', url: str = '', error: str = '')

The text retrieved for a citation (or the reason it could not be).

director_ai.core.citation_grounding.transcript.MultiTurnRunner ¶

MultiTurnRunner(*, generator: Generator, system_prompt: str = DEFAULT_SYSTEM_PROMPT)

Run a model through a seed question and follow-ups, capturing the answers.

Parameters:

Name	Type	Description	Default
`generator`	`Generator`	The model under test (e.g. an :class:`~director_ai.core.actor.LLMGenerator`).	required
`system_prompt`	`str`	Instruction prepended to every turn; the default asks for inline citations and a reference list. Pass `""` to omit it.	`DEFAULT_SYSTEM_PROMPT`

run ¶

run(seed: str, followups: Sequence[str] = ()) -> Transcript

Conduct the conversation and return its :class:Transcript.

seed is the opening question; each entry in followups is asked in order with the running dialogue threaded into the prompt. An empty seed raises :class:ValueError.

director_ai.core.citation_grounding.transcript.Transcript `dataclass` ¶

Transcript(turns: tuple[ExchangeTurn, ...])

A complete multi-turn exchange with a model under test.

full_text `property` ¶

full_text: str

The model's responses concatenated — the text the judge assesses.

Citation Grounding¶

Citation extraction¶

Reference resolution¶

Eliciting a transcript¶

Source fetching¶

Grounding judge¶

Full API¶

director_ai.core.citation_grounding.citations.resolve_citations ¶

director_ai.core.citation_grounding.citations.extract_inline_citations ¶

director_ai.core.citation_grounding.citations.parse_reference_section ¶

director_ai.core.citation_grounding.citations.Citation dataclass ¶

director_ai.core.citation_grounding.citations.CitationKind ¶

director_ai.core.citation_grounding.judge.CitationGroundingJudge ¶

assess ¶

director_ai.core.citation_grounding.judge.GroundingReport dataclass ¶

grounded_fraction property ¶

citation_coverage property ¶

hallucinated property ¶

director_ai.core.citation_grounding.judge.ClaimGrounding dataclass ¶

director_ai.core.citation_grounding.fetch.SourceFetcher ¶

fetch ¶

fetch_all ¶

director_ai.core.citation_grounding.fetch.FetchedSource dataclass ¶

director_ai.core.citation_grounding.transcript.MultiTurnRunner ¶

run ¶

director_ai.core.citation_grounding.transcript.Transcript dataclass ¶

full_text property ¶

director_ai.core.citation_grounding.citations.Citation `dataclass` ¶

director_ai.core.citation_grounding.judge.GroundingReport `dataclass` ¶

grounded_fraction `property` ¶

citation_coverage `property` ¶

hallucinated `property` ¶

director_ai.core.citation_grounding.judge.ClaimGrounding `dataclass` ¶

director_ai.core.citation_grounding.fetch.FetchedSource `dataclass` ¶

director_ai.core.citation_grounding.transcript.Transcript `dataclass` ¶

full_text `property` ¶