02 · Records

Problem, fix, result

Only the useful parts of each project

Case records

Short records, visible numbers

Evaluation systems

Context: While testing RAG answers on maintenance and safety manuals, good-looking answers still cited nearby but wrong authority
Problem: Industrial manual QA needs authority checks, not generic answer similarity
Bottleneck: Aggregate scores hid an item-level safety citation regression
Fix: Built a domain fixture, holdout split, gate states, hybrid RRF baseline, citation diagnostics, and SME review packet
Result: v5_t31 hybrid: recall@5 0.978; citation hit 0.945; safety specificity 5/5 exact; review queue 5 items with 2 P0

91 fixture items manual QA cases with expected source behavior

68 tests regression checks run in CI

5 items review queue support gaps prepared for SME review

Agent workflow evaluation

Context: While shaping invoice-agent workflows, document extraction looked clean but payment state could still be unsafe
Problem: P2P agents need state replay for duplicate invoices, vendor mismatch, receipts, approvals, and active holds
Bottleneck: 2-way, 3-way, invoice-before-GR, and consignment flows accept different event orders before action execution
Fix: Built JSONL replay, streaming CSV/XES import, BPIC2019 mapping pack, auto policy template, case oracle, seeded defect pack, audit CLI, agent-action gate, ops-readiness report, SQLite store replay, checkpoint recovery, scorecard, and CI
Result: 12 clean traces; 48 injected scenarios; BPIC2019 real XES smoke 1000 cases; action gate blocks unsafe payment; store replay recovers 4 partitions after simulated interruption; 255 attempted appends; 204 duplicate retries; tests 66

48 scenarios seeded replay defects checked by expected codes

36/36 critical caught duplicate, vendor, early payment, blocked payment rows caught

recovered recovery partition checkpoints restored after simulated interruption

Data systems · replenishment policy

Context: While converting retail demand forecasts into reorder levels, a cheaper model policy still created SKU-level stockout risk
Problem: Forecast error alone did not decide whether the reorder policy was usable
Bottleneck: A lower inventory policy could pass aggregate cost while failing service under lead-time and SKU checks
Fix: Built base-stock simulation, service-floor gate, frontier selection, lead-time grid, cost sensitivity grid, and named failure-mode report
Result: Top 50 SKUs; final gate review; robust q 0.99 passes 4/4 lead-time settings; SKU floor 48/50

0.865 model WAPE weighted forecast error, lower is better

-57.46% cost delta simulated policy cost reduction vs baseline

12/16 blocked lead-time lead-time and quantile settings blocked by service-floor failure

Agent tooling

Context: While running agent workflows with many tools, the session budget was being spent before the task began
Problem: Agent sessions were paying hidden context cost for tool catalogs
Bottleneck: MCP, OpenAPI, and custom manifests expose schema surface in different formats
Fix: Normalized schema cost, added reports, PR diffs, lazy-schema proxy, public catalog benchmark, and MCP config risk lint
Result: 10 public catalogs: 3,429 tools; risk lint sample: 2 servers, 5 findings, high risk

3,429 public tools tools across 10 MCP/OpenAPI catalogs

1.44M full tax estimated schema tokens loaded up front

5 risk findings no-probe MCP config lint findings in the sample report

Agent web context

Context: While using reference websites to guide agents, useful tone and structure came with unwanted source-site subject matter
Problem: Reference-site style was useful, but source-site nouns leaked into new products
Bottleneck: Keep rhythm and structure without copying spans or importing the original business
Fix: Split SITE and VOICE files, then added source-term boundaries and visible HTML scoring
Result: Stripe/LedgerFlow web output: 63.2 → 92.1; mimic risk 0.0

63.2 before webfit visible HTML score without context files

92.1 after webfit visible HTML score with SITE/VOICE context

+28.9 delta same prompt, different context

0.0 mimic risk copy overlap risk in the webfit report

Applied audio ML

Context: While estimating song key from audio, one global label failed on tracks that change key by section
Problem: Single-key prediction hides section-level modulation
Bottleneck: Expose region output while moving a large checkpoint out of git history
Fix: Built chroma/HPCP inference, CLI/API paths, release checkpoint loading, and SHA-256 verification
Result: Runnable local and API inference; benchmark page waits on training provenance