Graph-Based Correspondent Chain Prediction at Scale | AICIL Research

Introduction

When a bank initiates a cross-border wire transfer, the payment must traverse a chain of correspondent banks to reach the beneficiary institution. The selection of this chain is a consequential decision that affects settlement time, transaction cost, compliance complexity, and counterparty risk. Yet in most institutions, chain selection is performed by experienced payment operations staff using memorized relationships and static routing tables -- a process that is difficult to scale, impossible to optimize systematically, and vulnerable to single points of failure when key staff depart.

The correspondent banking network is a natural graph: banks are nodes, bilateral correspondent relationships are edges, and transactions flow along paths through this graph. Each edge carries properties -- fee schedules, processing speed, currency capabilities, compliance screening standards, and current capacity -- that determine its suitability for a given transaction. Chain prediction is thus a constrained shortest-path problem with dynamic, multi-dimensional edge weights.

Despite the graph-native character of this problem, most banking systems represent correspondent relationships in relational databases and resolve chains through recursive SQL queries or, more commonly, through manual lookup. This paper demonstrates that modeling the correspondent network as a property graph in Neo4j, combined with a learned ranking model for chain scoring, produces dramatically better predictions at dramatically lower computational cost.

Graph Model Architecture

The correspondent banking graph is modeled as a labeled property graph with three node types and five relationship types. Bank nodes represent financial institutions, with properties including SWIFT BIC, country of incorporation, regulatory jurisdiction, risk rating, and processing capabilities. Account nodes represent the nostro/vostro accounts that operationalize correspondent relationships, with properties including currency, balance thresholds, and cut-off times. Currency nodes represent the currencies supported across the network, enabling efficient multi-currency path queries.

Relationship types include CORRESPONDENT_OF (bilateral correspondent banking agreement), HOLDS_ACCOUNT (bank to nostro/vostro account), SUPPORTS_CURRENCY (bank to currency), ROUTES_THROUGH (observed transaction routing), and SCREENS_WITH (compliance screening capability). Each relationship carries temporal properties (effective date, last transaction date, status) and performance properties (average processing time, fee schedule, SLA compliance rate).

The current production graph contains 52,847 bank nodes, 214,392 correspondent relationships, and 1.2 million observed routing edges. The graph is updated daily from SWIFT directory data, bilateral agreement registries, and observed transaction flows. Historical routing data provides the training signal for the ML ranking model -- we observe which chains banks actually select for given transaction profiles and learn to predict those selections.

Native Graph Queries vs. Recursive SQL

To establish the performance advantage of graph-native querying, we benchmarked five representative chain resolution tasks against equivalent recursive SQL implementations on PostgreSQL 16 with optimized indexing. The tasks ranged from simple two-hop chain lookup to complex multi-currency chain resolution with capacity constraints.

For a simple two-hop chain query (find all paths from Bank A to Bank E through one intermediary), Neo4j Cypher completed in 3.2ms average versus 38.4ms for recursive SQL -- a 12x improvement. For three-hop chains with currency constraints (find paths supporting USD-to-NGN conversion with intermediate EUR steps), Neo4j completed in 8.7ms versus 247ms for recursive SQL -- a 28x improvement. The most dramatic difference appeared in complex chain resolution with dynamic constraints (find the top-5 chains from Bank A to Bank E, filtering by current processing capacity, fee thresholds, and compliance screening capabilities): Neo4j completed in 23ms versus 1,082ms for recursive SQL -- a 47x improvement.

The performance gap widens with graph size because recursive SQL query plans degrade with join depth while Neo4j traversal cost is proportional to the local neighborhood size, not the total graph size. At our current scale of 50K+ nodes, recursive SQL approaches are viable but slow. At the target scale of 1M+ nodes, recursive SQL becomes impractical for real-time chain resolution while Neo4j performance remains sub-second with appropriate indexing.

Key Findings

The ML ranking model achieves 91% accuracy in predicting the actual correspondent chain selected by originating banks for a given transaction profile (top-1 prediction). Top-3 accuracy reaches 97%. Confidence scores (0-100) correlate strongly with prediction reliability: predictions with confidence above 80 have 96% accuracy, while predictions below 40 have 72% accuracy. Per-bank learned preferences capture institutional routing biases that are invisible to rule-based systems -- for example, certain banks systematically prefer specific intermediaries for USD clearing despite equivalent alternatives being available, a pattern the model learns from historical routing data. Graph partitioning by currency zone enables the system to scale to 1M+ nodes with p99 query latency under 200ms.

ML Ranking Model

The chain prediction ranking model takes as input a transaction profile (originator, beneficiary, amount, currency pair, urgency) and a candidate set of correspondent chains, and outputs a score for each chain representing the predicted probability that the originating bank would select it.

The model is a LambdaMART gradient-boosted ranking model with 86 features organized into four groups: path features (hop count, total estimated fees, estimated settlement time, minimum capacity along the path), node features (risk ratings, processing speed statistics, and SLA compliance rates for each bank in the chain), relationship features (age of correspondent relationship, recent transaction volume, fee tier), and historical features (how frequently this chain has been used for similar transactions, recency of last use, outcome history).

Training data consists of 4.2 million observed chain selections over a 30-month period. For each observed transaction, we generate the candidate set (all viable chains up to 5 hops) and label the actually selected chain as the positive example. The model learns to rank the selected chain above alternatives, implicitly capturing the preferences and constraints that drove the selection.

The confidence score is derived from the model's output margin: the score difference between the top-ranked chain and the second-ranked chain, normalized to a 0-100 scale. High-confidence predictions indicate that the model strongly prefers one chain over alternatives, typically because the transaction profile closely matches historical patterns. Low-confidence predictions indicate multiple viable chains with similar scores, suggesting the transaction profile is unusual or that the originating bank's preferences for this corridor are not well established.

Scaling to 1M+ Nodes

The current graph of 52,847 nodes represents the active correspondent banking network, but several use cases require modeling a larger universe: including non-SWIFT payment networks, historical relationships that may be reactivated, and prospective relationships for corridor expansion analysis. Our target architecture supports 1M+ nodes with sub-second query latency.

The scaling strategy employs three techniques. First, currency-zone graph partitioning divides the graph into subgraphs aligned with major clearing currencies (USD, EUR, GBP, JPY, CNY), with cross-partition edges representing currency conversion points. Most chain queries operate within a single currency zone or cross at most one partition boundary, enabling parallel subgraph queries. Second, materialized path caching pre-computes and indexes the top-K chains for the 10,000 most common origin-destination pairs, serving these from cache with sub-millisecond latency. Third, adaptive index management monitors query patterns and automatically creates composite indexes for frequently queried property combinations.

Load testing with a synthetic 1.2M-node graph demonstrates p50 query latency of 45ms and p99 of 187ms for uncached chain resolution queries, well within the 200ms target. Cached queries return in under 5ms. The system processes 2,400 concurrent chain resolution queries per second on a three-node Neo4j cluster, sufficient for the projected transaction volumes of the initial deployment cohort.

Conclusion

Correspondent chain prediction is a graph-native problem that has been artificially forced into relational paradigms by the existing technology stack of most banking institutions. By modeling the correspondent network as a property graph and combining graph traversal with a learned ranking model, we achieve prediction accuracy and query performance that is unattainable with recursive SQL approaches. The confidence scoring methodology provides transparency into prediction reliability, enabling operations staff to trust high-confidence predictions while applying additional judgment to uncertain cases. As the system observes more routing decisions and outcomes, per-bank preference profiles become increasingly accurate, creating a data flywheel that improves prediction quality with usage volume.