Audit Log | Ukraine Truth

Per the methodology commitment that "we do not silently alter records," bulk corrections to the event corpus are documented here. Each audit cycle gets its own permanent entry.

Oblast Spellings Merged

2026-07-16

Summary

Our extraction records each event's oblast exactly as the source reported it, so the same region reaches us under many spellings. Donetsk arrived as "Donetsk Oblast" (7,562 events), "Donetsk" (1,318), "Donetsk region" (38), "Donetsk oblast" (4) and "Donetsk Region" (1). Until now the published dataset at /data/timeline-meta.json counted each spelling as a separate place. Every region was undercounted, the same region could appear twice in one ranking, and the regional ordering was wrong. We now merge spelling variants before publishing.

This affected the published dataset only. It did not affect any event page, any confidence tier, or any event's recorded location.

What we changed

Merged spelling, casing and suffix variants of the same place. Donetsk's five spellings become one entry of 8,923. Zaporizhzhia had eleven spellings (including the "Zaporizhia" and "Zaporizhzhya" transliterations) and becomes 6,699. Dnipropetrovsk had thirteen (including "Dnipro Oblast" and the Ukrainian adjectival "Dnipropetrovska") and becomes 4,668.
Two regions were published in the wrong order. Kherson previously ranked 3rd (4,964) and Zaporizhzhia 4th (4,862). Merged correctly, Zaporizhzhia ranks 3rd (6,699) ahead of Kherson (5,967). Dnipropetrovsk rises from 8th to 5th.
"Ukraine" is no longer ranked as a region. It appeared 6th with 3,813 events, above oblasts it contains. In a field that names a region within Ukraine, the value "Ukraine" records that the region was not determined, so those events are now grouped under "Unknown" rather than presented as a place.
Kyiv city and Kyiv Oblast are kept separate (757 and 4,005). They are different places, and merging the capital into the surrounding oblast would repeat the error being corrected here.
The distinct entry count falls from 4,272 to 3,922. The remainder is a long tail of one-off values, mostly strings naming several regions at once, which we leave exactly as reported.

What we deliberately did not change

No records were altered. Every event keeps the oblast its source reported, and the raw spellings are retained upstream in our build data. This is a change to how the published dataset groups those values for display, not a rewrite of what any source said.
No events were dropped and no totals moved. The dataset attributes 97,495 oblast mentions before and after. This is a merge, never a filter.
Nothing was bucketed away. We considered collapsing everything outside Ukraine's oblasts into a single "Other" entry and rejected it: that would have hidden real and significant geography behind a label, including Kursk Oblast (2,136 events), Belgorod Oblast (1,480) and the Black Sea (888). Those keep their own entries.
Event pages, corroboration tiers, identifiers and permalinks are untouched.
One inconsistency remains, and we are naming it rather than quietly fixing it. The year timeline pages and the research dashboard use an older, separate normalizer that merges Kyiv city into Kyiv Oblast. Their published figures are unchanged by this entry, so they will not match the dataset above for Kyiv. Reconciling them moves numbers on those pages too, so it will be its own documented change rather than a silent one folded into this correction.

Source Tone Removed

2026-06-19

Summary

We removed "source tone" (the per-claim certain / likely / uncertain / speculation rating) from the site end to end. It was a low-value signal: local extraction models flatten ordinary declarative news prose to "certain" regardless of how a source actually phrases a claim, which is why an entire nightly apparatus existed only to re-rate and police it. It was also easily confused with our real confidence signal, event corroboration. The site now presents a single confidence model.

What we changed

Stopped extracting and storing per-claim source tone. New events carry no tone field, and it is gone from event pages, the Event Explorer filter, the CSV export, and the research analytics.
Retired the nightly recalibration machinery (the audit, recalibrate, and calibration-guard steps) that existed only to manage tone flatlining. The nightly run is shorter and no longer makes that extra model pass.
Reworked the source-performance score on the Sources page. Its tone-derived "contradiction" term is gone; the score is now 0.70 corroboration rate plus 0.30 confirmation speed. Some source scores shifted as a result.
Corrected the methodology to describe one confidence model with the three tiers we actually compute (Verified, Likely, Uncorroborated). Contested and Debunked are now honestly described as reserved for manual editorial review, not automatically detected, and currently unused.

What we deliberately did not change

No records were silently altered. The 2026-06 recalibration entry below stays exactly as published. This is a forward-looking change to what we store and show, not a rewrite of history, and the original source archives retain their data.
Event corroboration is untouched. Site-level confidence tiers are still computed from independent source agreement, the same way as before.
Event identifiers, permalinks, and story links were preserved.

Audit Cycle 2026-06

Started 2026-06-02 · Completed 2026-06-04

Summary

After the April 2026 audit, ingestion continued: notably the UNIAN and Censor.NET archives, the historical sitemap backfills for several sources, and the ongoing multi-source sync. Those newer extractions were produced by local models that default per-claim confidence to certain regardless of how the source actually phrases it. Because recalibration had been a one-off audit pass rather than part of ingestion, none of them were re-rated. A backlog of 37,445 over-confident extraction files accumulated.

This cycle re-evaluated all of them against their original source articles and revised 35,827 individual claim confidence ratings downward across 20,061 events. As in April, the process is technically constrained to KEEP or DOWNGRADE only. No claim was upgraded. The catch-up was detected and carried out automatically by the new nightly sync routine, not by hand.

What we found

The April audit corrected the corpus as it existed then, but confidence calibration was a manual, one-time pass that was never wired into ingestion. So every source added or backfilled afterward re-introduced the same over-confidence pattern:

The Censor.NET and UNIAN archives, the Kyiv Independent and Ukrainska Pravda backfills, and the ongoing multi-source sync all inherited the flatline-to-certain behavior of the local extraction models.
None of these had been re-rated against source tone, leaving 37,445 flatlined files across the corpus.

The new nightly routine's audit step (scripts/audit-extractions.mjs) detected this backlog automatically on its first full runs, and the recalibration pass cleared it before publishing.

What we changed

Confidence recalibration

37,445 files were re-evaluated by qwen3.6-35B-A3B-MLX against each event's original source article. For every claim the model was asked whether the source itself uses hedging language ("reportedly," "claimed," "according to the Russian MoD") or asserts the claim flatly. Results:

17,384 files (46%) all-KEEP, with every claim correctly remaining "certain"
20,061 files (54%) had one or more downgrades
222,203 individual claims evaluated, of which 35,827 were downgraded (16.1%)

Downgrade transitions:

Transition	Claims	% of downgrades
certain → likely	23,618	65.9%
certain → uncertain	9,637	26.9%
certain → speculation	2,572	7.2%

The most affected sources were Censor.NET, UNIAN, Kyiv Independent, and Ukrainska Pravda, which together accounted for nearly all of the backlog. Before recalibration these files were flatlined to almost entirely "certain" (the pattern that flagged them). Afterward, the recalibrated extractions measured 69.3% certain, 20.3% likely, 8.2% uncertain, and 2.2% speculation, close to the healthy-corpus baseline of 67.7 / 23.4 / 8.2 / 0.7.

What we deliberately did not change

Site-level event confidence tiers (Verified, Likely, Contested, Uncorroborated, Debunked) are computed from independent source corroboration and are not affected by per-claim source-tone recalibration. A "Verified" event stays "Verified" regardless of whether its source claims read as "certain" or "uncertain".
Event identifiers, permalinks, and story-to-event linkages were preserved. No IDs were regenerated and no permalink broke.
No upgrades. The model could only KEEP or DOWNGRADE. Any attempted upgrade rejected the entire file's response, with no partial application. Zero claims were upgraded in this cycle.

Methodology, and what is different now

Detection and recalibration used the same public scripts as April (scripts/audit-extractions.mjs and scripts/recalibrate-confidence.mjs), with the recalibration model constrained via JSON schema to certain | likely | uncertain | speculation and technically unable to upgrade. Per-file change logs are written to audit-output/recalibration-changes.jsonl, and the originals of every modified file are preserved in the timestamped backup directories archive-backup-confidence-20260603-023238 and archive-backup-confidence-20260604-061545. These constitute the correction trail required by the methodology commitment, and are retained indefinitely.

The key change from April is structural. Recalibration is now part of the nightly pipeline rather than a manual one-off. Each night's new extractions are audited and re-rated before anything is published, and a calibration guard blocks the deploy if a batch still looks over-confident. This backlog cycle was a one-time catch-up. Routine calibration is now continuous, so a backlog of this size should not build up again.

Audit Cycle 2026-04

Started 2026-04-25 · Operational work completed 2026-05-07

Summary

A systematic audit of all 152,868 previously-extracted event files identified two dominant quality issues: (1) the local extraction model used for ~90% of the corpus systematically over-rated claim confidence, and (2) a fraction of articles had silently produced empty extractions. Both have been corrected using a different local model with a downgrade-only safety contract for confidence corrections.

The corpus grew from 673,015 events to 730,684 events (a net +57,669 from recovered extractions). 25,486 individual claim confidence ratings were revised downward across 10,325 events. No claim was ever upgraded; the correction process is technically constrained to KEEP or DOWNGRADE only.

What we found

The corpus was assembled by three different LLM extractors over time:

~90% by a local model (minimax-m2.5 via LM Studio): fast and cost-effective, but rated almost every extracted claim as certain regardless of how the source actually phrased it. This violated our editorial principle of acknowledging uncertainty.
~10% by Kimi K2.5 (cloud): better confidence calibration, but produced 165 extraction-error files due to its content-safety filter rejecting some war-reporting articles.
<1% recent, qwen3 (local, MoE): used for this audit's corrections.

A baseline audit also detected ~14,000 files with empty extractions (silent extraction failures during the original pipeline runs) and a long tail of mechanical issues: hacky multi-oblast strings, title prefixes embedded in person-entity names ("President Zelenskyy" instead of "Volodymyr Zelenskyy"), non-canonical name spellings, and country names typed as location when the surrounding text described state action.

What we changed

Deterministic auto-fixes (no LLM)

7,664 files received 15,805 mechanical fixes with no model judgment involved:

Fix kind	Count
Multi-oblast strings nulled (e.g. "Kharkiv/Donetsk")	6,307
Country re-typed from location → organization (when source described state action)	4,521
Toponym normalization in narrative text (Kiev→Kyiv, Kharkov→Kharkiv, Odessa→Odesa)	2,237
Toponym normalization in location fields	909
Parenthetical descriptors stripped ("Donetsk Oblast (occupied)" → "Donetsk Oblast")	682
Person-name canonicalization (Zelensky → Volodymyr Zelenskyy, Syrsky → Oleksandr Syrskyi, etc.)	556
Title prefixes stripped from person entity names	401
Toponym normalization in entity names	192

Recovered extractions

165 extraction-error files (Kimi content-filter rejections) were re-extracted with qwen3 (no content filter), recovering 1,334 events.

~10,000 silently-empty extraction files were re-attempted. Of those, ~50,000 events were recovered from articles that actually had content; ~85% of the remaining files were confirmed to be legitimately empty (analytical pieces, opinion essays, cultural commentary: no discrete events to extract).

Confidence recalibration

23,990 files with the over-confident "every claim certain" pattern were re-evaluated by qwen3 against the original source article text. The model was asked, for each claim, whether the source article itself uses hedging language ("reportedly," "claimed," "according to Russian MoD") or asserts the claim flatly. Results:

13,665 files (56%) all-KEEP: every claim correctly remained "certain"
10,325 files (43%) had one or more downgrades
142,369 individual claims evaluated · 25,486 downgraded (17.9%)

Downgrade transitions:

Transition	Claims	% of downgrades
certain → likely	17,620	69%
certain → uncertain	6,369	25%
certain → speculation	1,497	6%

Per the editorial principle of acknowledging uncertainty, the recalibration model was technically constrained to KEEP or DOWNGRADE only. If the model attempted to upgrade any claim in a file, the entire file's response was rejected: no partial application. Zero claims were upgraded in this audit cycle.

What we deliberately did not change

Site-level event confidence tiers (Verified, Likely, Contested, Uncorroborated, Debunked) are computed from independent source corroboration scores and are not affected by per-claim recalibration. A "Verified" event remains "Verified" regardless of whether its individual source claims read as "certain" or "uncertain."
Event identifiers, permalinks, and story↔event linkages were preserved. No event IDs were regenerated. No permalink broke.
Non-canonical person names appearing in narrative summary or claim text (~24K events) were left untouched in this cycle. Person-name rewrites in narrative text carry meaningful risk of attribution drift without article-level LLM judgment, and were judged out of scope. Slated for a future cycle.
Summary↔claim text duplication (~75K events with summary nearly identical to single-claim text) was logged but left untouched. A future polish pass may tighten these.

Methodology of corrections

All scripts used in this audit are in the public repository under scripts/. The detection rules are in scripts/audit-extractions.mjs. Recalibration logic and the source-tone definitions used in the LLM prompt are in scripts/recalibrate-confidence.mjs. Per-file change logs are written to audit-output/recalibration-changes.jsonl and audit-output/empty-retry-attempts.jsonl.

The recalibration model (qwen3.6-35B-A3B-MLX) was given each event's original source article text alongside the existing claim text and rating. Its response was constrained via JSON schema to certain | likely | uncertain | speculation. The detection prompt explicitly asked: does the SOURCE ARTICLE itself use hedging or assertive language? This intentionally separates source-tone from verification status: partisanship, factual accuracy, and corroboration are independent dimensions handled elsewhere in the pipeline.

Originals of every modified file are preserved in timestamped backup directories alongside the live archive. These constitute the correction trail required by the methodology commitment, and are retained indefinitely.

Reproducibility

The full audit can be reproduced from the public repository:

node scripts/audit-extractions.mjs: runs the rule-based detection
node scripts/auto-fix-extractions.mjs --apply: applies deterministic fixes
node scripts/retry-extraction-errors.mjs --apply: recovers content-filter failures
node scripts/retry-empty-extractions.mjs --apply: recovers silent extraction failures
node scripts/recalibrate-confidence.mjs --apply: recalibrates per-claim confidence