Audit Log

Per the methodology commitment that "we do not silently alter records," bulk corrections to the event corpus are documented here. Each audit cycle gets its own permanent entry.

Audit Cycle 2026-04

Started 2026-04-25 · Operational work completed 2026-05-07

Summary

A systematic audit of all 152,868 previously-extracted event files identified two dominant quality issues: (1) the local extraction model used for ~90% of the corpus systematically over-rated claim confidence, and (2) a fraction of articles had silently produced empty extractions. Both have been corrected using a different local model with a downgrade-only safety contract for confidence corrections.

The corpus grew from 673,015 events to 730,684 events (a net +57,669 from recovered extractions). 25,486 individual claim confidence ratings were revised downward across 10,325 events. No claim was ever upgraded; the correction process is technically constrained to KEEP or DOWNGRADE only.

What we found

The corpus was assembled by three different LLM extractors over time:

  • ~90% by a local model (minimax-m2.5 via LM Studio) — fast and cost-effective, but rated almost every extracted claim as certain regardless of how the source actually phrased it. This violated our editorial principle of acknowledging uncertainty.
  • ~10% by Kimi K2.5 (cloud) — better confidence calibration, but produced 165 extraction-error files due to its content-safety filter rejecting some war-reporting articles.
  • <1% recent — qwen3 (local, MoE) — used for this audit's corrections.

A baseline audit also detected ~14,000 files with empty extractions (silent extraction failures during the original pipeline runs) and a long tail of mechanical issues: hacky multi-oblast strings, title prefixes embedded in person-entity names ("President Zelenskyy" instead of "Volodymyr Zelenskyy"), non-canonical name spellings, and country names typed as location when the surrounding text described state action.

What we changed

Deterministic auto-fixes (no LLM)

7,664 files received 15,805 mechanical fixes with no model judgment involved:

Fix kindCount
Multi-oblast strings nulled (e.g. "Kharkiv/Donetsk")6,307
Country re-typed from location → organization (when source described state action)4,521
Toponym normalization in narrative text (Kiev→Kyiv, Kharkov→Kharkiv, Odessa→Odesa)2,237
Toponym normalization in location fields909
Parenthetical descriptors stripped ("Donetsk Oblast (occupied)" → "Donetsk Oblast")682
Person-name canonicalization (Zelensky → Volodymyr Zelenskyy, Syrsky → Oleksandr Syrskyi, etc.)556
Title prefixes stripped from person entity names401
Toponym normalization in entity names192

Recovered extractions

165 extraction-error files (Kimi content-filter rejections) were re-extracted with qwen3 (no content filter), recovering 1,334 events.

~10,000 silently-empty extraction files were re-attempted. Of those, ~50,000 events were recovered from articles that actually had content; ~85% of the remaining files were confirmed to be legitimately empty (analytical pieces, opinion essays, cultural commentary — no discrete events to extract).

Confidence recalibration

23,990 files with the over-confident "every claim certain" pattern were re-evaluated by qwen3 against the original source article text. The model was asked, for each claim, whether the source article itself uses hedging language ("reportedly," "claimed," "according to Russian MoD") or asserts the claim flatly. Results:

  • 13,665 files (56%) all-KEEP — every claim correctly remained "certain"
  • 10,325 files (43%) had one or more downgrades
  • 142,369 individual claims evaluated · 25,486 downgraded (17.9%)

Downgrade transitions:

TransitionClaims% of downgrades
certain → likely17,62069%
certain → uncertain6,36925%
certain → speculation1,4976%

Per the editorial principle of acknowledging uncertainty, the recalibration model was technically constrained to KEEP or DOWNGRADE only. If the model attempted to upgrade any claim in a file, the entire file's response was rejected — no partial application. Zero claims were upgraded in this audit cycle.

What we deliberately did not change

  • Site-level event confidence tiers (Verified, Likely, Contested, Uncorroborated, Debunked) are computed from independent source corroboration scores and are not affected by per-claim recalibration. A "Verified" event remains "Verified" regardless of whether its individual source claims read as "certain" or "uncertain."
  • Event identifiers, permalinks, and story↔event linkages were preserved. No event IDs were regenerated. No permalink broke.
  • Non-canonical person names appearing in narrative summary or claim text (~24K events) were left untouched in this cycle. Person-name rewrites in narrative text carry meaningful risk of attribution drift without article-level LLM judgment, and were judged out of scope. Slated for a future cycle.
  • Summary↔claim text duplication (~75K events with summary nearly identical to single-claim text) was logged but left untouched. A future polish pass may tighten these.

Methodology of corrections

All scripts used in this audit are in the public repository under scripts/. The detection rules are in scripts/audit-extractions.mjs. Recalibration logic and the source-tone definitions used in the LLM prompt are in scripts/recalibrate-confidence.mjs. Per-file change logs are written to audit-output/recalibration-changes.jsonl and audit-output/empty-retry-attempts.jsonl.

The recalibration model (qwen3.6-35B-A3B-MLX) was given each event's original source article text alongside the existing claim text and rating. Its response was constrained via JSON schema to certain | likely | uncertain | speculation. The detection prompt explicitly asked: does the SOURCE ARTICLE itself use hedging or assertive language? This intentionally separates source-tone from verification status — partisanship, factual accuracy, and corroboration are independent dimensions handled elsewhere in the pipeline.

Originals of every modified file are preserved in timestamped backup directories alongside the live archive. These constitute the correction trail required by the methodology commitment, and are retained indefinitely.

Reproducibility

The full audit can be reproduced from the public repository:

  1. node scripts/audit-extractions.mjs — runs the rule-based detection
  2. node scripts/auto-fix-extractions.mjs --apply — applies deterministic fixes
  3. node scripts/retry-extraction-errors.mjs --apply — recovers content-filter failures
  4. node scripts/retry-empty-extractions.mjs --apply — recovers silent extraction failures
  5. node scripts/recalibrate-confidence.mjs --apply — recalibrates per-claim confidence