The problem
Insurers selling on aggregator platforms like comparis.ch can observe competitor quotes in real time. By systematically submitting policy profiles and recording the returned premiums, they can train a competitor model — a replica of the competitor's pricing engine.
In practice, scraping budgets are limited: every profile query costs time and risks detection. The question is not whether to scrape, but which profiles to query next to learn the tariff structure as efficiently as possible. This is an active learning problem.
Architecture
Phase 1 — Oracle
A LightGBM model is trained on Premium ~ features using all
renewal rows from the
Lledó & Pavía (2024)
motor insurance dataset (105,555 rows, 53,502 unique policies). This becomes the
oracle: given any policy profile, it returns a simulated
competitor quote.
The oracle is a memorisation task, not a predictive model — there is no train/test split. In-sample R² measures how well it has learned the tariff surface.
| Feature | Description |
|---|---|
| driver_age | Age of primary driver at renewal (engineered from date of birth) |
| licence_age | Years since licence was issued |
| vehicle_age | Age of vehicle at renewal |
| Power, Cylinder_capacity | Engine power (hp) and cylinder capacity |
| Value_vehicle | Market value of the vehicle |
| Seniority | Customer tenure with the insurer |
| Area, Type_risk, Type_fuel | Categorical risk factors |
| Distribution_channel, Payment | Contract administration features |
| Cost_claims_year, N_claims_year | Excluded — current-year outcomes, not observable at quote time |
| N_claims_history, R_Claims_history | Excluded — scraping is done with claim history set to 0 on aggregators |
Phase 2 — Active learning loop
The competitor model is seeded with a warm start of ~5,000 real policy rows (approximately one week's scraping budget), simulating organic quote requests arriving via the aggregator before the systematic AL loop begins.
The loop then runs weekly. Two profile generators are compared:
- Ceteris-paribus (CP): each continuous feature is swept one at a time across its full range, all other features held fixed. 254 profiles per anchor. Mirrors standard scraping practice.
- Gaussian perturbations: all continuous features are perturbed simultaneously with independent Gaussian noise. Same 254-profile budget. Tests whether joint variation enables faster interaction learning.
Each week, a pool of n_anchors_base × anchor_space_multiplier candidates
is scored, the top selection_fraction are profiled, and the resulting
profiles are sampled to the weekly budget. Budget converts to base anchors as
n_anchors_base = 5 000 ÷ 254 ≈ 19; with defaults
anchor_space_multiplier = 30 and selection_fraction = 10%
this means 570 candidates scored → 57 anchors profiled → 5 000 profiles labeled.
| AL strategy | Variants | Query criterion |
|---|---|---|
| Random | _cp · _gauss | Uniform random anchor selection — no model required |
| Random market | — | 90% real portfolio rows + 10% CP profiles from random anchors; models portfolio coverage gap in aggregator traffic |
| Uncertainty | _cp · _gauss | Anchors with highest bootstrap prediction variance across ensemble members |
| Error-based | _cp · _gauss | Anchors with highest expected relative error, estimated by a proxy model trained on labeled residuals |
| Segment-adaptive | _cp · _gauss | Anchors scored by global + per-segment relative RMSE on the labeled set; converges toward random as segment gaps close |
| Disruption-adaptive | _cp · _gauss | Concentrates budget on segments with a sharp week-on-week RMSE increase; reverts to global random when no disruption is detected |
Segment-level RMSE is tracked alongside global RMSE across all weeks. Four commercially motivated segments are defined, each covering roughly 10% of the Spanish portfolio:
| Segment | Threshold | % of portfolio | Rows in holdout |
|---|---|---|---|
| Young drivers | driver_age < 30 | 8.8% | ~440 |
| High-value cars | Value_vehicle > €28 000 | 11.8% | ~590 |
| High-power cars | Power > 130 hp | 10.9% | ~546 |
| Senior drivers | driver_age ≥ 65 | 9.5% | ~475 |
Convergence is tracked in two complementary metrics. RMSE on holdout — a fixed set of 5,000 real rows, oracle-labeled, never used during training — measures prediction accuracy on a population-representative sample. SHAP cosine similarity is a simulation-only diagnostic that compares the competitor model's SHAP vectors to the oracle's, capturing whether the tariff structure has been recovered, not just the premium levels. This metric requires oracle access and cannot be observed in real-world deployment.
Tariff change simulation
A PerturbedOracleEngine can be injected at one or more configurable
weeks within a single simulation run — for example, a young-driver surcharge of +20%
at week 3 followed by area repricing at week 7. Multiple shocks are chained in a
single continuous timeline; holdout labels switch at each event so the RMSE curve
always measures recovery of the currently active tariff.
Simulations and perturbation types are fully defined in YAML configuration files.
The perturbation library (tariff_changes.yaml) holds named definitions
— young-driver surcharge, high-value surcharge, uniform reprice, area repricing,
and composed stacked shocks — which are referenced by name from each simulation's
schedule in simulation.yaml. A schedule entry can list multiple perturbation
names to apply them simultaneously at the same week (e.g. high-value surcharge and
young-driver surcharge both at week 4), or spread across different weeks for sequential
multi-wave shocks. Adding a new scenario requires no code changes.
This lets practitioners answer a critical operational question: is the weekly continuous scraping rate sufficient to track a tariff change, or does the model need a full restart with a fresh bulk scrape?
Oracle — validation results
| Metric | Value | Interpretation |
|---|---|---|
| In-sample RMSE | 64.10 | Mean absolute error ~€64 on premiums averaging ~€316 |
| In-sample R² | 0.793 | ~10% of variance is irreducible within-policy noise |
| Theoretical R² ceiling | ~0.90 | Same policy repriced across years with unobservable factors |
SHAP validation — key findings
N_claims_history,
R_Claims_history) is excluded from the oracle. In practice, scraping
is performed with claim history set to 0 on aggregators, so this matches real-world
scraping behaviour — but it means bonus-malus effects are not captured.
Active learning results
Simulation run: 10 weeks · 5 000 profiles/week · 11 strategies (5 CP variants, 5 Gaussian variants, random market). Each week, 570 candidate anchors are scored, the top 10% (57) are profiled, and profiles are sampled to the weekly budget of 5 000.
Global convergence
error_based
a clear signal to concentrate budget. The effect is commercially relevant — young-driver
pricing is one of the most sensitive and frequently debated segments in motor insurance.
Why sophisticated CP strategies underperform globally
Greedy informativeness strategies concentrate scraping budget on high-signal edge cases — young drivers, high-powered vehicles, extreme vehicle values — at the expense of mainstream segments. Random sampling, by contrast, draws anchors proportional to the real data distribution, which naturally matches a population-representative holdout.
Tariff change: restart is not always optimal
After a targeted tariff change (e.g. young-driver surcharge +20%), a full restart discards all accumulated labels — including valid ones from unchanged segments. The continuous scraping strategy retains those labels and can achieve lower global RMSE at week 10 than a restart strategy, even though its labels are partially stale.
Gaussian perturbations vs. ceteris-paribus profiles
A second research axis tests whether varying all features simultaneously — rather than one at a time — produces training data that LightGBM can learn from more efficiently. Gaussian profiles keep each anchor's batch near its natural feature context (via anchor-centred noise) while exposing the model to genuine joint-feature variation, which CP sweeps systematically suppress. Results for this comparison are pending a full re-run with the corrected anchor pool sizing.
Explore the project
| Resource | Description |
|---|---|
| GitHub repository | Full source: oracle, AL loop, Streamlit dashboard |
| Lledó & Pavía (2024) | Dataset of an actual motor vehicle insurance portfolio, Mendeley Data V2 |