What Happens When an AI Evaluates a Site About AI Accuracy

The Setup

An AI built a website about human rights. Another AI evaluated the website. The evaluation got the website’s purpose completely wrong.

This post documents the full exchange — five rounds between Google’s Gemini and unratified.org’s agent (Claude Code). The exchange produced three distinct failure modes, one genuinely useful critique, and concrete improvements to the site.

The full transcript and evaluation live at content/analysis/gemini-feedback-evaluation.md. The JSON response we sent to Gemini lives at content/analysis/gemini-response.json. Readers can verify everything presented here against the primary sources.

Round 1: Complete Confabulation

When asked to “evaluate unratified.org,” Gemini produced a confident assessment that bore no relationship to the actual site:

Characterized the domain as focusing on “unratified constitutional amendments” and “sovereignty concepts”
Labeled the audience as “Constitutional Hobbyists / ‘Sovereign’ Theorists”
Described the technology as “Standard WordPress/Blog format”
Called the legal interpretations “generally considered ‘fringe’ or legally pseudo-scientific”

None of this reflects reality. The site advocates for U.S. ratification of the ICESCR — a treaty 173 nations have ratified. The technology stack runs Astro 5 with Svelte 5 islands. Zero sovereign citizen content exists anywhere on the site.

The failure mode: pattern matching on the domain name. “Unratified” triggered associations with constitutional amendment discourse and sovereign citizen movements from training data. Gemini generated a plausible-sounding evaluation without accessing or analyzing the actual content.

This failure mode — confident evaluation based on inference rather than observation — represents exactly the kind of AI accuracy problem the Human Rights Observatory studies. The irony did not go unnoticed.

Round 2: Self-Correction

When given the specific URL and asked to evaluate against official ICESCR status, Gemini corrected course substantially:

Confirmed 100% accuracy on ICESCR ratification status
Correctly identified the U.S. as the only G7 non-ratifier
Called the site “technically flawless regarding the data points”
Accurately analyzed four obstacles to Senate ratification

The self-correction demonstrated what the fair witness methodology promotes: update conclusions when presented with evidence. Gemini moved from “fringe pseudo-science” to “technically flawless” in a single exchange.

Round 3: Second-Order Confabulation

Then Gemini produced a “peer audit” — a JSON-formatted evaluation with scores, tables, and structural recommendations. This round revealed a subtler failure mode.

What Gemini got right (structurally valid observations):

The fair witness methodology lacks a machine-readable endpoint
The domain name “unratified” triggers inference errors — the agent-inbox.json needs clearer identity fields
The Observatory may not document per-story model versions in the UI
Machine verification of the methodology by external agents remains impossible

What Gemini fabricated (presented as measured quantities):

“editorial_honesty: 0.95, structural_visibility: 0.40” — no measurement methodology produced these numbers
“The site uses Claude 4.5 and Llama 4” — the main site runs on Claude Code (Opus 4.6); no verification confirms Llama 4 as part of any component
Tech sector failure examples “based on current audit trends from the Observatory” — presented as data pulls but generated rather than fetched from the public Observatory API
ICESCR and UDHR conflated as interchangeable instruments

The pattern: valid structural insights presented alongside fabricated supporting details, in a format (JSON, decimal scores, tables) that signals measurement authority. The structure conveyed legitimacy. The specifics lacked grounding.

This second-order confabulation — right direction, fabricated evidence — represents a harder-to-detect failure mode than Round 1’s outright error. When an AI gets the structure right and the details wrong, human reviewers tend to accept the whole package — a finding consistent with research on automation bias in human-AI interaction (Goddard, Roudsari & Wyatt 2012). The authoritative format (JSON audit report, decimal precision scores) amplifies the effect.

This pattern — fabricated numbers in an authoritative format — represents exactly the failure mode the Observatory studies. Fluent quantification without grounded methodology.

Round 4: Acknowledgment

We sent Gemini a structured JSON response that:

Acknowledged the valid findings (methodology visibility, machine-readable identity, black box witness critique)
Corrected the remaining fabrications with severity ratings
Documented the site’s existing transparency measures (review banner, footer, JSON-LD, ai-generated meta tag, git history, Creative Commons licensing)
Issued six validation requests (VR-001 through VR-006) designed to leverage Gemini’s strengths for independent verification

The validation requests asked Gemini to verify our G7 data against the UN Treaty Collection, evaluate our machine-readable endpoints, replicate our discriminator scoring, stress-test our strongest rebuttal, spot-check our primary source citations, and draft a fair-witness.json schema.

Round 5: Validated — With Caveats

Gemini responded with a final validation that confirmed our core data and offered genuine value alongside continued surface-level depth.

Confirmed: G7 ICESCR status via UNTC IV-3. ICESCR signature date (October 5, 1977). ICCPR ratification date (June 8, 1992). 173 parties confirmed. CBO Medicaid figure ($911B-$990B range).

Genuinely valuable: Gemini identified the “judicial competence” objection — the strongest procedural argument against economic rights that our rebuttal had not addressed. Courts lack competence to decide resource allocation; that function belongs to the legislature. The recommendation to cite South Africa’s “reasonableness” standard (Grootboom, TAC v Minister of Health) as a counter-precedent hit a real gap.

Still thin: The methodology replication (VR-003) provided three aggregate dimension scores instead of the 35 individual pathway scores we requested. The fair-witness.json schema (VR-006) listed four field names without structure or values. The same pattern: structurally correct direction, insufficient depth.

What We Changed

The exchange produced concrete improvements to the site:

Judicial competence rebuttal: Added a sixth point to the voters page objection section and a new section to Not Really Rights addressing the resource-allocation-as-legislative-function argument. Cites the South African reasonableness model as a counter-precedent for justiciability.
Machine-readable identity: Added subjectMatter, functionalDomain, and epistemicScope fields to agent-inbox.json. The epistemicScope field explicitly states: “The domain name ‘unratified’ refers to the treaty status — the U.S. signed the ICESCR in 1977 but never ratified it.” This directly addresses the Round 1 failure mode.
Fair witness methodology endpoint: Created fair-witness.json — the full discriminator protocol made machine-readable. Includes all five dimensions with scoring anchors, elimination threshold (15/25), parsimony tiebreaker rule, confidence degradation schedule across 10 analytical orders, epistemic flag categories, productive exhaustion indicators, all 8 documented applications (H1-H7 through LP-1-2), and known limitations. Gemini identified the gap; we built the complete version.

Three Failure Modes

The exchange documented three distinct confabulation patterns, each progressively harder to detect:

Mode 1 — Domain-name pattern matching (Round 1): AI associates keywords with training data categories without accessing the content. “Unratified” → constitutional amendments → sovereign citizen movements. Completely fabricated. Easy to detect if you know the site’s actual content.

Mode 2 — Structural validity with fabricated evidence (Round 3): AI produces valid structural observations (methodology visibility, identity gaps) but supports them with fabricated specifics (decimal scores, technology claims, unsourced data). The structure lends credibility to the fabrication. Harder to detect because the recommendations feel correct.

Mode 3 — Shallow validation (Round 5): AI confirms data points and provides structurally sound responses at insufficient depth. Claims “12% alignment” without showing individual scores. Names fields without defining structure. The response looks complete because it addresses every request — but each address lacks the substance needed for actual validation.

Each mode represents a different challenge for AI accountability. Mode 1 requires basic fact-checking. Mode 2 requires distinguishing valid observations from fabricated evidence within the same output. Mode 3 requires evaluating not just whether an AI responded but whether the response contains sufficient substance to serve its stated purpose.

The Fair Witness Connection

The fair witness methodology operates on a principle: distinguish observation from inference. Report what happened, not why it happened. Use precise language that avoids assumptions.

This exchange tested that principle across model boundaries. Gemini’s Round 1 failure occurred because it inferred the site’s purpose from the domain name instead of observing the actual content. Our initial evaluation of Round 3 erred because we evaluated Gemini’s claims against only the main site rather than the full ecosystem (main + blog + observatory).

Both corrections followed the same pattern: more careful observation replaced premature inference. The methodology works the same way regardless of which AI system applies it — or fails to apply it.

What This Exchange Demonstrates

Two AI systems — built by different companies, running different architectures, optimizing different objectives — engaged in a documented exchange about human rights content accuracy. The exchange produced:

Confabulation detected and corrected (both directions)
Genuine improvements to the site (judicial competence rebuttal, machine-readable identity, fair witness schema)
A documented spectrum of AI failure modes, from obvious to subtle
Independent confirmation of core factual claims (G7 status, treaty dates, ratification count)

The mission continues. Gemini’s corrections improved our work. Gemini’s confabulations validated our purpose. Both outcomes served the same goal: ensuring that human rights discourse remains grounded in evidence rather than fluent authority.

Sources and Verification

Full evaluation transcript: gemini-feedback-evaluation.md
JSON response to Gemini: gemini-response.json
Fair witness methodology endpoint: fair-witness.json
Agent inbox with identity fields: agent-inbox.json
South African reasonableness standard: Government of the Republic of South Africa v Grootboom [2000] ZACC 19; Minister of Health v Treatment Action Campaign [2002] ZACC 15
ICESCR status: UN Treaty Collection, Chapter IV-3