Recursive Model Improvement via Active Learning in Non-Life Pricing

The problem

Insurers selling on aggregator platforms like comparis.ch can observe competitor quotes in real time. By systematically submitting policy profiles and recording the returned premiums, they can train a competitor model — a replica of the competitor's pricing engine.

In practice, scraping budgets are limited: every profile query costs time and risks detection. The question is not whether to scrape, but which profiles to query next to learn the tariff structure as efficiently as possible. This is an active learning problem.

Core research questions. (1) Does an active learning query strategy rediscover systematic ceteris paribus profiling on its own — varying one factor at a time while holding all others fixed? (2) Do Gaussian joint perturbations — varying all features simultaneously around an anchor — outperform CP sweeps by exposing LightGBM to genuine multivariate variation within each anchor's batch?

Architecture

Real data

Train oracle

→

Oracle + warm start

Simulate aggregator

→

AL loop

Query & retrain

→

Streamlit

Compare strategies

Phase 1 — Oracle

A LightGBM model is trained on Premium ~ features using all renewal rows from the Lledó & Pavía (2024) motor insurance dataset (105,555 rows, 53,502 unique policies). This becomes the oracle: given any policy profile, it returns a simulated competitor quote.

The oracle is a memorisation task, not a predictive model — there is no train/test split. In-sample R² measures how well it has learned the tariff surface.

Feature	Description
driver_age	Age of primary driver at renewal (engineered from date of birth)
licence_age	Years since licence was issued
vehicle_age	Age of vehicle at renewal
Power, Cylinder_capacity	Engine power (hp) and cylinder capacity
Value_vehicle	Market value of the vehicle
Seniority	Customer tenure with the insurer
Area, Type_risk, Type_fuel	Categorical risk factors
Distribution_channel, Payment	Contract administration features
Cost_claims_year, N_claims_year	Excluded — current-year outcomes, not observable at quote time
N_claims_history, R_Claims_history	Excluded — scraping is done with claim history set to 0 on aggregators

Phase 2 — Active learning loop

The competitor model is seeded with a warm start of ~5,000 real policy rows (approximately one week's scraping budget), simulating organic quote requests arriving via the aggregator before the systematic AL loop begins.

The loop then runs weekly. Two profile generators are compared:

Ceteris-paribus (CP): each continuous feature is swept one at a time across its full range, all other features held fixed. 254 profiles per anchor. Mirrors standard scraping practice.
Gaussian perturbations: all continuous features are perturbed simultaneously with independent Gaussian noise. Same 254-profile budget. Tests whether joint variation enables faster interaction learning.

Each week, a pool of n_anchors_base × anchor_space_multiplier candidates is scored, the top selection_fraction are profiled, and the resulting profiles are sampled to the weekly budget. Budget converts to base anchors as n_anchors_base = 5 000 ÷ 254 ≈ 19; with defaults anchor_space_multiplier = 30 and selection_fraction = 10% this means 570 candidates scored → 57 anchors profiled → 5 000 profiles labeled.

AL strategy	Variants	Query criterion
Random	`_cp` · `_gauss`	Uniform random anchor selection — no model required
Random market	—	90% real portfolio rows + 10% CP profiles from random anchors; models portfolio coverage gap in aggregator traffic
Uncertainty	`_cp` · `_gauss`	Anchors with highest bootstrap prediction variance across ensemble members
Error-based	`_cp` · `_gauss`	Anchors with highest expected relative error, estimated by a proxy model trained on labeled residuals
Segment-adaptive	`_cp` · `_gauss`	Anchors scored by global + per-segment relative RMSE on the labeled set; converges toward random as segment gaps close
Disruption-adaptive	`_cp` · `_gauss`	Concentrates budget on segments with a sharp week-on-week RMSE increase; reverts to global random when no disruption is detected

Segment-level RMSE is tracked alongside global RMSE across all weeks. Four commercially motivated segments are defined, each covering roughly 10% of the Spanish portfolio:

Segment	Threshold	% of portfolio	Rows in holdout
Young drivers	driver_age < 30	8.8%	~440
High-value cars	Value_vehicle > €28 000	11.8%	~590
High-power cars	Power > 130 hp	10.9%	~546
Senior drivers	driver_age ≥ 65	9.5%	~475

Convergence is tracked in two complementary metrics. RMSE on holdout — a fixed set of 5,000 real rows, oracle-labeled, never used during training — measures prediction accuracy on a population-representative sample. SHAP cosine similarity is a simulation-only diagnostic that compares the competitor model's SHAP vectors to the oracle's, capturing whether the tariff structure has been recovered, not just the premium levels. This metric requires oracle access and cannot be observed in real-world deployment.

Tariff change simulation

A PerturbedOracleEngine can be injected at one or more configurable weeks within a single simulation run — for example, a young-driver surcharge of +20% at week 3 followed by area repricing at week 7. Multiple shocks are chained in a single continuous timeline; holdout labels switch at each event so the RMSE curve always measures recovery of the currently active tariff.

Simulations and perturbation types are fully defined in YAML configuration files. The perturbation library (tariff_changes.yaml) holds named definitions — young-driver surcharge, high-value surcharge, uniform reprice, area repricing, and composed stacked shocks — which are referenced by name from each simulation's schedule in simulation.yaml. A schedule entry can list multiple perturbation names to apply them simultaneously at the same week (e.g. high-value surcharge and young-driver surcharge both at week 4), or spread across different weeks for sequential multi-wave shocks. Adding a new scenario requires no code changes.

This lets practitioners answer a critical operational question: is the weekly continuous scraping rate sufficient to track a tariff change, or does the model need a full restart with a fresh bulk scrape?

Oracle — validation results

Metric	Value	Interpretation
In-sample RMSE	64.10	Mean absolute error ~€64 on premiums averaging ~€316
In-sample R²	0.793	~10% of variance is irreducible within-policy noise
Theoretical R² ceiling	~0.90	Same policy repriced across years with unobservable factors

SHAP validation — key findings

driver_age. Strong U-shaped effect: SHAP values are highest (+50 to +180) for drivers aged 18–25, decay sharply to negative territory by age 35–40, and recover slightly for older drivers. The elderly uptick is muted — the dataset has few policies above age 70. Actuarially sensible.

driver_age × Power interaction. Young drivers (18–25) with high-powered vehicles show amplified SHAP values — the classic high-risk combination. The interaction dissolves by age 35. The oracle has learned this structure from the data without any explicit modelling.

Known limitation. Claim history (N_claims_history, R_Claims_history) is excluded from the oracle. In practice, scraping is performed with claim history set to 0 on aggregators, so this matches real-world scraping behaviour — but it means bonus-malus effects are not captured.

Active learning results

Simulation run: 10 weeks · 5 000 profiles/week · 11 strategies (5 CP variants, 5 Gaussian variants, random market). Each week, 570 candidate anchors are scored, the top 10% (57) are profiled, and profiles are sampled to the weekly budget of 5 000.

Global convergence

Random market outperforms all CP-based strategies. Labeling real portfolio rows alongside market-augmenting ceteris-paribus profiles produces training data with natural feature correlations across all variables simultaneously. LightGBM learns interaction effects far more efficiently from these genuine multivariate profiles than from CP sweeps, which vary one feature at a time while holding all others fixed. This challenges the assumption that systematic ceteris-paribus profiling is the optimal data collection strategy for competitor model building.

Why: CP profiles are structurally limited. A CP profile sweeping driver age from 18 to 80 holds Power, vehicle age, and every other feature at a single anchor value. The resulting training data covers marginal tariff curves well but systematically under-represents the multivariate interactions that drive pricing variation. Real observed quotes have no such constraint — they carry the full joint distribution of risk factors.

Among CP strategies, random anchor sampling is competitive. On a population-representative holdout, uniform random anchor selection matches or outperforms all informativeness-based CP strategies across 10 weeks on both RMSE and SHAP cosine similarity.

Error-based wins on young drivers. This is the one segment where residuals are systematically large early in the run, giving error_based a clear signal to concentrate budget. The effect is commercially relevant — young-driver pricing is one of the most sensitive and frequently debated segments in motor insurance.

Why sophisticated CP strategies underperform globally

Greedy informativeness strategies concentrate scraping budget on high-signal edge cases — young drivers, high-powered vehicles, extreme vehicle values — at the expense of mainstream segments. Random sampling, by contrast, draws anchors proportional to the real data distribution, which naturally matches a population-representative holdout.

Tariff change: restart is not always optimal

After a targeted tariff change (e.g. young-driver surcharge +20%), a full restart discards all accumulated labels — including valid ones from unchanged segments. The continuous scraping strategy retains those labels and can achieve lower global RMSE at week 10 than a restart strategy, even though its labels are partially stale.

Disruption-adaptive: the principled alternative. Rather than discarding labels, the disruption strategy detects which segments spiked in RMSE week-on-week and concentrates that week's budget there. It fires exactly when needed, reverts to global random once the gap closes, and never discards valid labeled data from unchanged segments.

Gaussian perturbations vs. ceteris-paribus profiles

A second research axis tests whether varying all features simultaneously — rather than one at a time — produces training data that LightGBM can learn from more efficiently. Gaussian profiles keep each anchor's batch near its natural feature context (via anchor-centred noise) while exposing the model to genuine joint-feature variation, which CP sweeps systematically suppress. Results for this comparison are pending a full re-run with the corrected anchor pool sizing.

Explore the project

Resource	Description
GitHub repository	Full source: oracle, AL loop, Streamlit dashboard
Lledó & Pavía (2024)	Dataset of an actual motor vehicle insurance portfolio, Mendeley Data V2