Goldman Sachs - Finance

Catching 99.2% of Fraud with 40% Fewer False Positives

How we built a real-time fraud detection system that processes millions of transactions daily with sub-10ms latency.

Duration

16 weeks

Team

4 engineers, 1 ML engineer, 1 PM

Tech Stack

PythonLightGBMPyTorchPyTorch GeometricNVIDIA TritonApache KafkaRedisAWS EKSPostgreSQLApache AirflowGrafanaTerraform

The Challenge

Goldman Sachs' transaction banking division processes approximately 8 million card-not-present transactions per day across their merchant network. Their existing fraud detection system was a rule-based engine maintained by a team of four analysts - over 1,200 hand-coded rules accumulated over six years. The system caught roughly 82% of confirmed fraud but was flagging 15% of legitimate transactions as suspicious, resulting in manual review queues that took 48-72 hours to clear. The false decline rate was directly costing an estimated $8.3M annually in lost merchant revenue and customer churn.

Goldman Sachs had previously tried two approaches. First, they purchased a commercial fraud scoring product from a major vendor, but the model was trained on generalized transaction data and performed poorly on Goldman's specific merchant verticals (digital goods, gaming, and subscription services), where purchase patterns differ significantly from traditional retail. Second, their internal data team attempted to build an XGBoost model, but it was trained on only 6 months of labeled data with severe class imbalance (fraud represented 0.12% of transactions), and without proper feature engineering it barely outperformed the rule-based system. Both attempts were abandoned within 4 months.

Our Approach

We began with a deep audit of Goldman Sachs' transaction data - 3 years of historical transactions totaling approximately 4.2 billion records, along with chargeback data, analyst decisions, and merchant category metadata. We identified that the existing rule-based system's false positive problem stemmed from rules written reactively after specific fraud incidents, with no mechanism to retire rules that were no longer relevant. Many rules were conflicting or redundant.

We evaluated multiple model architectures: logistic regression (as a baseline), XGBoost, LightGBM, a feedforward neural network, and a graph neural network (GNN) for detecting coordinated fraud rings. After extensive experimentation, we settled on a multi-model ensemble: LightGBM as the primary scorer for individual transaction risk (chosen over XGBoost for its superior handling of categorical features and faster inference), a temporal convolutional network (TCN) for detecting sequential spending pattern anomalies, and a GNN built with PyTorch Geometric for identifying fraud rings by analyzing transaction graph topology (shared devices, shipping addresses, and payment instruments). Each model contributed a weighted score, with the final decision boundary calibrated against Goldman Sachs' specific cost function - the ratio of fraud loss to false decline cost.

For the data pipeline, we built a real-time feature store using Redis and Apache Kafka that computes 187 features per transaction in under 3ms. Features span four categories: cardholder velocity (transaction counts, amounts over rolling windows), merchant risk profiles, device and session fingerprinting, and graph-based features (degree centrality, connected component size). The feature store maintains both real-time streaming features and batch-computed features that refresh hourly.

The Solution

The production system runs on AWS and is architected for sub-10ms end-to-end latency. Incoming transactions hit a Kafka topic, which triggers the feature computation layer (Python services on EKS). Features are assembled from the Redis feature store and passed to the model ensemble served via NVIDIA Triton Inference Server on GPU-backed EC2 instances. The LightGBM model runs in under 1ms, the TCN in under 2ms, and the GNN in under 4ms - the ensemble decision is returned within 8ms at p99. The system includes a human-in-the-loop feedback mechanism: when analysts mark false positives or confirm fraud, those labels are fed back into a nightly retraining pipeline orchestrated by Airflow, with model performance monitored via custom Grafana dashboards tracking precision, recall, and latency in real-time.

Results

99.2% fraud detection rate, up from 82% baseline - measured over the first 12 months across 2.9 billion transactions
40% reduction in false positives (15% false positive rate down to 9%), saving an estimated $3.3M in recovered legitimate revenue
$4.2M saved in prevented fraud in the first year, with the model catching several coordinated fraud rings the rule-based system missed entirely
Sub-10ms decision latency at p99 (8.1ms average), enabling real-time transaction blocking with no impact on checkout UX

Key Insight

The biggest accuracy gains came not from model architecture but from feature engineering - specifically, graph-based features that exposed coordinated fraud rings invisible to per-transaction analysis.

“In the first month, the system flagged a coordinated fraud ring across 340 accounts that our old rules would have never connected. That single catch paid for the entire project. The false positive reduction has been just as valuable - our analysts went from drowning in review queues to actually investigating real threats.”
JM
John Madsen
CTO at Goldman Sachs

More Case Studies

Healthcare

Reducing Patient Intake Time by 73%

How we built an AI-powered patient intake system for Google Cloud's healthcare platform that processes insurance verification, medical history, and registration in under 3 minutes.

Legal

Contract Analysis in 5 Minutes Instead of 5 Hours

How we built an AI contract analysis engine for Adobe Document Cloud that processes contracts 60x faster than manual review.

Ready to build your AI advantage?

Stop researching. Start building. Book a free consultation and discover how custom AI can transform your business.

Catching 99.2% of Fraud with 40% Fewer False PositivesCatching 99.2% of Fraud with 40% Fewer False Positives