Incidents & Retros
Production maturity is easier to claim than to demonstrate. These are real incidents from the system, written tight: what failed, how it was caught, what caused it, what fixed it, and what changed structurally so the same class of bug doesn't recur.
The public set is curated as case studies, not a chronological feed. Three retros across three different failure domains; the private interview kit holds deeper retros with full code paths and naive first attempts.
-
PFE short-sell — accidental short on a long-only system
2026-04-22 · P1 · Trade execution
The executor opened a short position on a stock the system was only supposed to be exiting. Defense-in-depth fix shipped same day.
-
EOD pipeline recovery — 50 minutes to 4.5 minutes
2026-05-01 · P2 · Infrastructure
Three independent issues stacked into a 10× runtime regression. Closing them the same day cut runtime by an order of magnitude and propagated a defensive pattern to other parts of the data layer.
-
Predictor meta-model collapse — 27 UP / 0 DOWN
2026-04-28 · P1 · ML model
A weekly retrain produced a degenerate output distribution. Root cause was placeholder constants in production training data; the fix moved validation IC from 0.053 to 0.132.