Recursive Model Improvement via
Active Learning in Non-Life Pricing

Python Python
trains
LightGBM LightGBM
validated by
SHAP SHAP
explored via
Streamlit Streamlit

A simulation of how an insurer reverse-engineers a competitor's pricing engine by scraping aggregator quotes — and how active learning makes that scraping strategy as efficient as possible.

David Fischer · 2026

Python 3.12 · LightGBM · SHAP Active Learning · 11 strategies Streamlit Dashboard Motor Insurance · 105 K policies Lledó & Pavía (2024)

The problem

Insurers selling on aggregator platforms like comparis.ch can observe competitor quotes in real time. By systematically submitting policy profiles and recording the returned premiums, they can train a competitor model — a replica of the competitor's pricing engine.

In practice, scraping budgets are limited: every profile query costs time and risks detection. The question is not whether to scrape, but which profiles to query next to learn the tariff structure as efficiently as possible. This is an active learning problem.

Core research questions. (1) Does an active learning query strategy rediscover systematic ceteris paribus profiling on its own — varying one factor at a time while holding all others fixed? (2) Do Gaussian joint perturbations — varying all features simultaneously around an anchor — outperform CP sweeps by exposing LightGBM to genuine multivariate variation within each anchor's batch?

Architecture

Real data
Train oracle
Oracle + warm start
Simulate aggregator
AL loop
Query & retrain
Streamlit
Compare strategies

Phase 1 — Oracle

A LightGBM model is trained on Premium ~ features using all renewal rows from the Lledó & Pavía (2024) motor insurance dataset (105,555 rows, 53,502 unique policies). This becomes the oracle: given any policy profile, it returns a simulated competitor quote.

The oracle is a memorisation task, not a predictive model — there is no train/test split. In-sample R² measures how well it has learned the tariff surface.

FeatureDescription
driver_ageAge of primary driver at renewal (engineered from date of birth)
licence_ageYears since licence was issued
vehicle_ageAge of vehicle at renewal
Power, Cylinder_capacityEngine power (hp) and cylinder capacity
Value_vehicleMarket value of the vehicle
SeniorityCustomer tenure with the insurer
Area, Type_risk, Type_fuelCategorical risk factors
Distribution_channel, PaymentContract administration features
Cost_claims_year, N_claims_yearExcluded — current-year outcomes, not observable at quote time
N_claims_history, R_Claims_historyExcluded — scraping is done with claim history set to 0 on aggregators

Phase 2 — Active learning loop

The competitor model is seeded with a warm start of ~5,000 real policy rows (approximately one week's scraping budget), simulating organic quote requests arriving via the aggregator before the systematic AL loop begins.

The loop then runs weekly. Two profile generators are compared:

Each week, a pool of n_anchors_base × anchor_space_multiplier candidates is scored, the top selection_fraction are profiled, and the resulting profiles are sampled to the weekly budget. Budget converts to base anchors as n_anchors_base = 5 000 ÷ 254 ≈ 19; with defaults anchor_space_multiplier = 30 and selection_fraction = 10% this means 570 candidates scored → 57 anchors profiled → 5 000 profiles labeled.

AL strategyVariantsQuery criterion
Random_cp · _gaussUniform random anchor selection — no model required
Random market90% real portfolio rows + 10% CP profiles from random anchors; models portfolio coverage gap in aggregator traffic
Uncertainty_cp · _gaussAnchors with highest bootstrap prediction variance across ensemble members
Error-based_cp · _gaussAnchors with highest expected relative error, estimated by a proxy model trained on labeled residuals
Segment-adaptive_cp · _gaussAnchors scored by global + per-segment relative RMSE on the labeled set; converges toward random as segment gaps close
Disruption-adaptive_cp · _gaussConcentrates budget on segments with a sharp week-on-week RMSE increase; reverts to global random when no disruption is detected

Segment-level RMSE is tracked alongside global RMSE across all weeks. Four commercially motivated segments are defined, each covering roughly 10% of the Spanish portfolio:

SegmentThreshold% of portfolioRows in holdout
Young driversdriver_age < 308.8%~440
High-value carsValue_vehicle > €28 00011.8%~590
High-power carsPower > 130 hp10.9%~546
Senior driversdriver_age ≥ 659.5%~475

Convergence is tracked in two complementary metrics. RMSE on holdout — a fixed set of 5,000 real rows, oracle-labeled, never used during training — measures prediction accuracy on a population-representative sample. SHAP cosine similarity is a simulation-only diagnostic that compares the competitor model's SHAP vectors to the oracle's, capturing whether the tariff structure has been recovered, not just the premium levels. This metric requires oracle access and cannot be observed in real-world deployment.

Tariff change simulation

A PerturbedOracleEngine can be injected at one or more configurable weeks within a single simulation run — for example, a young-driver surcharge of +20% at week 3 followed by area repricing at week 7. Multiple shocks are chained in a single continuous timeline; holdout labels switch at each event so the RMSE curve always measures recovery of the currently active tariff.

Simulations and perturbation types are fully defined in YAML configuration files. The perturbation library (tariff_changes.yaml) holds named definitions — young-driver surcharge, high-value surcharge, uniform reprice, area repricing, and composed stacked shocks — which are referenced by name from each simulation's schedule in simulation.yaml. A schedule entry can list multiple perturbation names to apply them simultaneously at the same week (e.g. high-value surcharge and young-driver surcharge both at week 4), or spread across different weeks for sequential multi-wave shocks. Adding a new scenario requires no code changes.

This lets practitioners answer a critical operational question: is the weekly continuous scraping rate sufficient to track a tariff change, or does the model need a full restart with a fresh bulk scrape?

Oracle — validation results

Metric Value Interpretation
In-sample RMSE 64.10 Mean absolute error ~€64 on premiums averaging ~€316
In-sample R² 0.793 ~10% of variance is irreducible within-policy noise
Theoretical R² ceiling ~0.90 Same policy repriced across years with unobservable factors

SHAP validation — key findings

driver_age. Strong U-shaped effect: SHAP values are highest (+50 to +180) for drivers aged 18–25, decay sharply to negative territory by age 35–40, and recover slightly for older drivers. The elderly uptick is muted — the dataset has few policies above age 70. Actuarially sensible.
driver_age × Power interaction. Young drivers (18–25) with high-powered vehicles show amplified SHAP values — the classic high-risk combination. The interaction dissolves by age 35. The oracle has learned this structure from the data without any explicit modelling.
Known limitation. Claim history (N_claims_history, R_Claims_history) is excluded from the oracle. In practice, scraping is performed with claim history set to 0 on aggregators, so this matches real-world scraping behaviour — but it means bonus-malus effects are not captured.

Active learning results

Simulation run: 10 weeks · 5 000 profiles/week · 11 strategies (5 CP variants, 5 Gaussian variants, random market). Each week, 570 candidate anchors are scored, the top 10% (57) are profiled, and profiles are sampled to the weekly budget of 5 000.

Global convergence

Random market outperforms all CP-based strategies. Labeling real portfolio rows alongside market-augmenting ceteris-paribus profiles produces training data with natural feature correlations across all variables simultaneously. LightGBM learns interaction effects far more efficiently from these genuine multivariate profiles than from CP sweeps, which vary one feature at a time while holding all others fixed. This challenges the assumption that systematic ceteris-paribus profiling is the optimal data collection strategy for competitor model building.
Why: CP profiles are structurally limited. A CP profile sweeping driver age from 18 to 80 holds Power, vehicle age, and every other feature at a single anchor value. The resulting training data covers marginal tariff curves well but systematically under-represents the multivariate interactions that drive pricing variation. Real observed quotes have no such constraint — they carry the full joint distribution of risk factors.
Among CP strategies, random anchor sampling is competitive. On a population-representative holdout, uniform random anchor selection matches or outperforms all informativeness-based CP strategies across 10 weeks on both RMSE and SHAP cosine similarity.
Error-based wins on young drivers. This is the one segment where residuals are systematically large early in the run, giving error_based a clear signal to concentrate budget. The effect is commercially relevant — young-driver pricing is one of the most sensitive and frequently debated segments in motor insurance.

Why sophisticated CP strategies underperform globally

Greedy informativeness strategies concentrate scraping budget on high-signal edge cases — young drivers, high-powered vehicles, extreme vehicle values — at the expense of mainstream segments. Random sampling, by contrast, draws anchors proportional to the real data distribution, which naturally matches a population-representative holdout.

Tariff change: restart is not always optimal

After a targeted tariff change (e.g. young-driver surcharge +20%), a full restart discards all accumulated labels — including valid ones from unchanged segments. The continuous scraping strategy retains those labels and can achieve lower global RMSE at week 10 than a restart strategy, even though its labels are partially stale.

Disruption-adaptive: the principled alternative. Rather than discarding labels, the disruption strategy detects which segments spiked in RMSE week-on-week and concentrates that week's budget there. It fires exactly when needed, reverts to global random once the gap closes, and never discards valid labeled data from unchanged segments.

Gaussian perturbations vs. ceteris-paribus profiles

A second research axis tests whether varying all features simultaneously — rather than one at a time — produces training data that LightGBM can learn from more efficiently. Gaussian profiles keep each anchor's batch near its natural feature context (via anchor-centred noise) while exposing the model to genuine joint-feature variation, which CP sweeps systematically suppress. Results for this comparison are pending a full re-run with the corrected anchor pool sizing.

Explore the project

ResourceDescription
GitHub repository Full source: oracle, AL loop, Streamlit dashboard
Lledó & Pavía (2024) Dataset of an actual motor vehicle insurance portfolio, Mendeley Data V2