Banking & risk
Credit scoring, fraud detection, and lifetime-value prediction from customer and transaction tables.
Synthefy-Nori-V1 — Replaces XGBoost
·→New Release · Synthefy-Nori-V1
The foundation model for tables. Fully open source.
Why this matters
The predictions that actually run a business — credit decisions, fraud flags, demand, pricing, churn, capacity — aren't made from prose or pixels. They're made from rows and columns. Tabular data is the most valuable data most companies own, and the hardest to get right.
Credit scoring, fraud detection, and lifetime-value prediction from customer and transaction tables.
Demand forecasting, pricing simulation, inventory planning, and scenario analysis.
Claims severity, underwriting risk, and churn — straight from policy and claims tables.
Readmission risk, cost prediction, and triage from structured clinical records.
Capacity planning, throughput forecasting, incident risk, and predictive maintenance.
Conversion, lifetime value, and propensity scoring across the funnel.
The problem
Most teams still reach for gradient boosting — XGBoost, LightGBM — and every new dataset means starting from zero. You explore the data, engineer features, pick a model, tune it, validate it, and stand up the MLOps to keep it alive. Then the data drifts and you run the entire gauntlet again.
And again, from the top, every time the data shifts.
The shift
Pretrain once, use everywhere, never train per task — that's what foundation models did for text and for images. Tables never had one. And an LLM can't fill the gap: language models reason over tokens, not over millions of numeric rows and columns. Tabular prediction needs a foundation model built for tables.
Meet Synthefy-Nori
Your labeled rows are the context, and the predictions come back in a single forward pass. The model handles preprocessing, high dimensionality, and skewed targets on its own.
Pass your training table — X_train and y_train — straight into the call as context. No gradient updates, no training loop, no knobs to turn.
A single predict() runs your rows through the model once. Missing values, redundant columns, and skewed targets are handled internally.
No validation sweep, no model-versioning sprawl. When the data drifts, you send the new rows as context — there is nothing to retrain.
This is the entire API.
from synthefy import SynthefyNoriClient client = SynthefyNoriClient()predictions = client.predict(X_train, y_train, X_test)Benchmark proof
Synthefy-Nori-V1 was evaluated on 96 regression datasets from three independent sources — TabArena, TALENT, and OpenML-Reg. Same train/test splits, same preprocessing, same hardware for every model. Higher R² is better.
R² measures how much of the variation in the target a model explains — higher is better, 1.0 is perfect. Averaged across all 96 datasets, Synthefy-Nori-V1 leads TabPFN-3, the strongest prior tabular foundation model, at a tenth of the size.
TabArena skews toward larger, modern datasets — gradient boosting’s home turf. With zero tuning, Synthefy-Nori still wins 12 of 13 regression datasets against XGBoost and LightGBM given a full tuning budget.
6M parameters, ~22 MB on disk, versus TabPFN-3 at 58.3M — with higher mean R². The diamonds regression (16K rows) runs end to end in ~2.8 seconds on a single GPU.
Teaser — Thinking Mode
Thinking Mode decides how to process each dataset before predicting — augmentations, normalizations, preprocessing — with no human in the loop. The gains land on the large, hard datasets and compound in aggregate, lifting mean R² to 0.7531.
The payoff
The months you spend on the pipeline — EDA, data engineering, feature selection, training, tuning — collapse into two function calls. Here's what leaves your workflow for good:
No training loops, no learning-rate sweeps, no early-stopping callbacks. There are no knobs to turn — the model configures its own preprocessing per dataset.
Missing values, noisy labels, redundant columns, heavy tails — Synthefy-Nori was pretrained on synthetic data deliberately built to contain all of it. Hand it the raw rows.
Drift used to mean spinning up a training run. With in-context learning it means sending the new rows in as context. No retraining, no model-versioning sprawl.
A closer look
Across 96 datasets the two models usually tie, which keeps the average margin small. But where either model has a decisive edge, Nori lands the most wins — and the largest — on real, public datasets, at a tenth of the size. And on the small-to-mid tables it’s built for, it returns predictions faster too.
Most datasets are a tie, which keeps the average margin small. But of the 11 datasets where either model has a decisive edge (>0.02 R²), Nori takes 8 — by the widest margins. The standout is Job Profitability, where it lifts R² from 0.14 to 0.41, tripling the explained variance. socmob and sulfur also win under a second independent benchmark suite, so these aren’t harness flukes — every dataset is public.
On those small-to-mid tables, Nori returns predictions in roughly a second — faster than TabPFN-3 in every size band, at 6M parameters versus 58.3M. No training run, no cluster: one library call on a single GPU. Past ~100k cells, a quick gradient-boosted model still wins — we’d rather be straight about that.
Fully open source — Apache 2.0
Code on GitHub, weights on Hugging Face — free and open source. Point it at the data you already have and see what it predicts.