Back to Research
Technical ReportNovember 202552 pages

OGEE: Outcome-Guided Expert Estimation for Compliance Intelligence

Abstract

Compliance machine learning models are conventionally trained on human-labeled datasets where experienced analysts flag transactions as suspicious or clear. This label-based approach encounters a fundamental ceiling: inter-annotator agreement among compliance analysts rarely exceeds 89%, meaning the model is trained to replicate human disagreement rather than optimize for regulatory outcomes. This paper introduces OGEE (Outcome-Guided Expert Estimation), a framework that trains compliance models on downstream outcomes -- whether transactions were ultimately accepted, rejected, returned for information (RFI), or flagged by correspondent banks and regulators -- rather than on analyst labels. We formalize the problem as a contextual bandit with per-bank reward profiles, derive the training objective using maximum entropy inverse reinforcement learning (MaxEnt IRL), and demonstrate that OGEE models achieve a 23% improvement in outcome prediction accuracy compared to label-trained baselines. We specify the champion/challenger deployment methodology with 7-day minimum validation windows required for production deployment in regulated environments.

Introduction

The application of machine learning to financial compliance has followed a familiar pattern: collect labeled training data from human experts, train a supervised classification model, and deploy it to flag transactions for review. This approach has produced meaningful improvements in alert prioritization and risk scoring, but it suffers from a fundamental limitation that we term the labeller ceiling.

The labeller ceiling arises because compliance decisions are inherently subjective. When two experienced compliance analysts review the same transaction, they disagree on the appropriate disposition approximately 11% of the time. This disagreement is not noise -- it reflects genuine ambiguity in compliance assessment, differences in institutional risk appetite, and variation in individual expertise. A model trained to replicate these labels cannot exceed the consistency of the labels themselves, establishing an effective performance ceiling of approximately 89%.

More importantly, label-based training optimizes for the wrong objective. The relevant question is not "would a compliance analyst flag this transaction?" but rather "will this transaction be accepted by the correspondent bank, returned for additional information, or rejected?" These downstream outcomes are the ground truth that determines whether a compliance decision was correct, and they incorporate information that is unavailable to the labelling analyst -- the correspondent bank's risk appetite, current regulatory enforcement priorities, and the quality of the compliance documentation package.

The 89% Labeller Ceiling

To quantify the labeller ceiling, we conducted a controlled experiment with 42 compliance analysts across six institutions. Each analyst reviewed 500 identical transactions and assigned a disposition: clear, escalate, or block. The transactions spanned a representative distribution of risk levels, corridors, and transaction types.

Inter-annotator agreement, measured by Fleiss' kappa, was 0.72 -- indicating substantial but imperfect agreement. When aggregated by majority vote, 89.2% of transactions received a consistent disposition across all annotators. The remaining 10.8% showed meaningful disagreement, with the most common pattern being a split between "clear" and "escalate" dispositions.

Analysis of the disagreement cases reveals that they cluster around three transaction profiles: moderate-value transactions in medium-risk corridors (where the escalation threshold is ambiguous), transactions involving complex corporate structures (where beneficial ownership determination requires judgment), and transactions with partial documentary evidence (where the sufficiency of source-of-funds documentation is debatable). These are precisely the cases where model assistance would be most valuable -- and precisely the cases where label-based training is least reliable.

When we compared analyst labels to actual downstream outcomes (whether correspondent banks accepted the transactions without issue, requested additional information, or rejected them), we found that analyst labels predicted outcomes with 71% accuracy. The gap between the 89% inter-annotator agreement and the 71% outcome prediction accuracy indicates that even when analysts agree, they are not reliably predicting the metric that matters.

Contextual Bandit Formulation

We formalize compliance decision-making as a contextual bandit problem. The context vector x includes transaction features (amount, currency, corridor, frequency), entity features (originator and beneficiary risk profiles, graph-derived relationship features), and documentary features (completeness and quality scores for compliance documentation). The action space A consists of the compliance decisions available: approve with standard documentation, approve with enhanced documentation, request additional information from the originator, or block the transaction.

The reward signal r is derived from downstream outcomes observed after the transaction is submitted to the correspondent banking network. We define a structured reward function that assigns +1.0 for accepted transactions (straight-through processing), +0.3 for transactions accepted after a single information request, -0.5 for transactions returned for information by the correspondent bank, and -1.0 for rejected transactions. These reward values were calibrated through interviews with compliance officers at eight institutions and reflect the relative operational cost of each outcome.

Critically, the reward function varies by correspondent bank. A transaction that receives straight-through processing at one correspondent may trigger an RFI at another, reflecting differences in risk appetite, screening methodology, and regulatory jurisdiction. OGEE maintains per-bank reward profiles that are updated continuously as outcome data is observed, enabling the model to learn the implicit compliance standards of each correspondent in the network.

Key Findings

OGEE models trained on 18 months of outcome data (2.1 million transactions with observed correspondent bank dispositions) achieved 87.4% outcome prediction accuracy compared to 71.0% for label-trained baselines -- a 23% relative improvement. Per-bank reward profiles converge after approximately 400 observed outcomes per correspondent, enabling accurate prediction for the 200+ most active correspondent relationships. The champion/challenger deployment framework detected model degradation (5%+ accuracy decline) within an average of 3.2 days, well within the 7-day validation window. MaxEnt IRL training required 14 hours on a single A100 GPU for the full transaction corpus, with incremental updates (daily batch) completing in under 20 minutes.

Maximum Entropy Inverse Reinforcement Learning

The core technical challenge in OGEE is inferring the implicit compliance policy of each correspondent bank from observed accept/reject decisions. We approach this as an inverse reinforcement learning problem: given observed outcomes (the expert demonstrations), recover the reward function that rationalizes those outcomes.

We employ the Maximum Entropy IRL framework, which selects the reward function that makes the observed outcomes most likely while maintaining maximum entropy over unobserved behaviors. This avoids overfitting to the specific transactions observed and produces a reward function that generalizes to novel transaction profiles. The objective function maximizes the log-likelihood of observed outcomes under the learned reward function, subject to an entropy regularization term that penalizes overconfident predictions.

The per-bank reward function is parameterized as a linear combination of transaction features with bank-specific weights. This structure encodes the intuition that different correspondent banks weight compliance factors differently -- one bank may be particularly sensitive to jurisdictional risk while another prioritizes transaction amount thresholds. The learned weights provide an interpretable representation of each correspondent bank's implicit compliance policy, which has independent value for relationship management and compliance planning.

Model Governance and Deployment

Deploying ML models in regulated compliance environments requires rigorous governance procedures. OGEE specifies a champion/challenger deployment framework in which a new model version (the challenger) runs in shadow mode alongside the production model (the champion) for a minimum of 7 calendar days before it can be promoted.

During the validation window, both models score every transaction, but only the champion's scores are used for operational decisions. The challenger's predictions are compared against observed outcomes as they arrive, and a suite of validation metrics is computed daily: outcome prediction accuracy (overall and per-corridor), false negative rate (transactions the challenger would have approved that were subsequently rejected), calibration curves (alignment between predicted probabilities and observed outcome frequencies), and feature importance stability (whether the SHAP-derived feature rankings are consistent with the champion's explanations).

Promotion criteria require that the challenger equal or exceed the champion on all primary metrics and show no statistically significant degradation on any secondary metric. If the challenger fails validation, it is automatically retired and the training pipeline is notified for investigation. This framework ensures that model updates improve outcomes without introducing unexpected behavioral changes that could compromise compliance integrity.

Conclusion

OGEE represents a fundamental shift in how compliance ML models are trained and evaluated. By anchoring the training signal to downstream regulatory outcomes rather than human labels, OGEE sidesteps the labeller ceiling that limits conventional approaches and produces models that optimize for the metric that actually determines compliance success. The contextual bandit formulation with per-bank reward profiles captures the heterogeneity of correspondent bank compliance standards, enabling more accurate prediction and more efficient compliance documentation assembly. The champion/challenger governance framework provides the deployment safety required for regulated environments while enabling continuous model improvement as outcome data accumulates.