02 · Records

Problem, fix, result

Only the useful parts of each project

Case records

Short records, visible numbers

Evaluation systems

Industrial RAG Gate

Open
Context
While testing RAG answers on maintenance and safety manuals, good-looking answers still cited nearby but wrong authority
Problem
Industrial manual QA needs authority checks, not generic answer similarity
Bottleneck
Aggregate scores hid an item-level safety citation regression
Fix
Built a domain fixture, holdout split, gate states, hybrid RRF baseline, citation diagnostics, and SME review packet
Result
v5_t31 hybrid: recall@5 0.978; citation hit 0.945; safety specificity 5/5 exact; review queue 5 items with 2 P0
91 fixture items manual QA cases with expected source behavior
68 tests regression checks run in CI
5 items review queue support gaps prepared for SME review
Agent workflow evaluation

P2P Replay Gate

Open
Context
While shaping invoice-agent workflows, document extraction looked clean but payment state could still be unsafe
Problem
P2P agents need state replay for duplicate invoices, vendor mismatch, receipts, approvals, and active holds
Bottleneck
2-way, 3-way, invoice-before-GR, and consignment flows accept different event orders before action execution
Fix
Built JSONL replay, streaming CSV/XES import, BPIC2019 mapping pack, auto policy template, case oracle, seeded defect pack, audit CLI, agent-action gate, ops-readiness report, SQLite store replay, checkpoint recovery, scorecard, and CI
Result
12 clean traces; 48 injected scenarios; BPIC2019 real XES smoke 1000 cases; action gate blocks unsafe payment; store replay recovers 4 partitions after simulated interruption; 255 attempted appends; 204 duplicate retries; tests 66
48 scenarios seeded replay defects checked by expected codes
36/36 critical caught duplicate, vendor, early payment, blocked payment rows caught
recovered recovery partition checkpoints restored after simulated interruption
Data systems · replenishment policy

Replenishment Policy Gate

Open
Context
While converting retail demand forecasts into reorder levels, a cheaper model policy still created SKU-level stockout risk
Problem
Forecast error alone did not decide whether the reorder policy was usable
Bottleneck
A lower inventory policy could pass aggregate cost while failing service under lead-time and SKU checks
Fix
Built base-stock simulation, service-floor gate, frontier selection, lead-time grid, cost sensitivity grid, and named failure-mode report
Result
Top 50 SKUs; final gate review; robust q 0.99 passes 4/4 lead-time settings; SKU floor 48/50
0.865 model WAPE weighted forecast error, lower is better
-57.46% cost delta simulated policy cost reduction vs baseline
12/16 blocked lead-time lead-time and quantile settings blocked by service-floor failure
Agent tooling

tool-tax

Open
Context
While running agent workflows with many tools, the session budget was being spent before the task began
Problem
Agent sessions were paying hidden context cost for tool catalogs
Bottleneck
MCP, OpenAPI, and custom manifests expose schema surface in different formats
Fix
Normalized schema cost, added reports, PR diffs, lazy-schema proxy, public catalog benchmark, and MCP config risk lint
Result
10 public catalogs: 3,429 tools; risk lint sample: 2 servers, 5 findings, high risk
3,429 public tools tools across 10 MCP/OpenAPI catalogs
1.44M full tax estimated schema tokens loaded up front
5 risk findings no-probe MCP config lint findings in the sample report
Agent web context

site-voice-packs

Open
Context
While using reference websites to guide agents, useful tone and structure came with unwanted source-site subject matter
Problem
Reference-site style was useful, but source-site nouns leaked into new products
Bottleneck
Keep rhythm and structure without copying spans or importing the original business
Fix
Split SITE and VOICE files, then added source-term boundaries and visible HTML scoring
Result
Stripe/LedgerFlow web output: 63.2 → 92.1; mimic risk 0.0
63.2 before webfit visible HTML score without context files
92.1 after webfit visible HTML score with SITE/VOICE context
+28.9 delta same prompt, different context
0.0 mimic risk copy overlap risk in the webfit report
Applied audio ML

Modulation-aware Key Estimator

Open
Context
While estimating song key from audio, one global label failed on tracks that change key by section
Problem
Single-key prediction hides section-level modulation
Bottleneck
Expose region output while moving a large checkpoint out of git history
Fix
Built chroma/HPCP inference, CLI/API paths, release checkpoint loading, and SHA-256 verification
Result
Runnable local and API inference; benchmark page waits on training provenance
12 key classes major/minor pitch-class targets
CLI/API interfaces local and service inference paths
pending training record dataset and training manifest still incomplete
Index

Repos and reports

Work
Result
Signal
Industrial RAG Gate Evaluation systems
v5_t31 hybrid: recall@5 0.978; citation hit 0.945; safety specificity 5/5 exact; review queue 5 items with 2 P0
fixture items: 91 · tests: 68 · review queue: 5 items
P2P Replay Gate Agent workflow evaluation
12 clean traces; 48 injected scenarios; BPIC2019 real XES smoke 1000 cases; action gate blocks unsafe payment; store replay recovers 4 partitions after simulated interruption; 255 attempted appends; 204 duplicate retries; tests 66
scenarios: 48 · critical caught: 36/36 · recovery: recovered
Replenishment Policy Gate Data systems · replenishment policy
Top 50 SKUs; final gate review; robust q 0.99 passes 4/4 lead-time settings; SKU floor 48/50
model WAPE: 0.865 · cost delta: -57.46% · blocked lead-time: 12/16
tool-tax Agent tooling
10 public catalogs: 3,429 tools; risk lint sample: 2 servers, 5 findings, high risk
public tools: 3,429 · full tax: 1.44M · risk findings: 5
site-voice-packs Agent web context
Stripe/LedgerFlow web output: 63.2 → 92.1; mimic risk 0.0
before webfit: 63.2 · after webfit: 92.1 · delta: +28.9 · mimic risk: 0.0
Modulation-aware Key Estimator Applied audio ML
Runnable local and API inference; benchmark page waits on training provenance
key classes: 12 · interfaces: CLI/API · training record: pending
P2P Replay Gate Agent workflow evaluation · procurement controls
Pre-checks purchase-to-pay agent actions against replayed workflow state before payment
larger BPIC2019 run · trace visualizer · external AP workflow feedback
Production-grade RAG Evaluation System AI/Data Engineering
Gate prompt, model, index, and chunking changes on retrieval quality, latency, and cost
benchmark · demo · runbook
AI-native Research Workspace Full-stack AI
Compare retrieval changes, prompt diffs, eval output, and model responses in one local review surface
prototype · screen capture · usage notes