6. Synthetic Datasets#
make_quantification is the quantification analogue of
sklearn.datasets.make_classification. Where a classification generator
hands you one labelled table, a quantification generator must hand you many
bags — test samples whose class proportions vary — together with each bag’s
true prevalence, because that vector (not the individual labels) is what a
quantifier is scored against.
6.1. How it works#
The generator works in two stages:
Build one labelled population. Internally it calls
make_classificationonce to create a pool ofn_samplespoints, whose difficulty you control withclass_sep(cluster separation),flip_y(label noise),n_featuresandweights.Draw ``n_batches`` bags from that population, where each bag’s class balance, feature positions, or labelling are perturbed according to the chosen
shift_type.
The call returns three aligned objects:
from mlquantify.datasets import make_quantification
Xs, ys, prevalences = make_quantification(n_batches=10, random_state=0)
# Xs : list of feature matrices, one per bag
# ys : list of label vectors, one per bag
# prevalences : (n_batches, n_classes) array of TRUE prevalences
prevalences[i] is the quantification target for bag i — feed Xs[i] to
a fitted quantifier and compare its prediction against prevalences[i].
6.2. Controlling the prevalence (prior shift)#
By default shift_type="prior": the class prevalence \(P(y)\) changes from
bag to bag while the class-conditional features \(P(x \mid y)\) stay fixed.
The prevalence argument decides how each bag’s target prevalence is drawn:
|
Behaviour |
|---|---|
|
Prevalences spread uniformly over the probability simplex — the full range of shifts (the default). |
|
A regular grid over the simplex (the Artificial Prevalence Protocol); the
bag count is then set by |
|
Bags drawn i.i.d. from the population, so prevalence fluctuates only with
sampling noise around |
|
From a Dirichlet centred on |
# bags concentrated near a 70/30 split
Xs, ys, prevs = make_quantification(
n_batches=30, prevalence="dirichlet",
target_prevalence=[0.7, 0.3], concentration=150, random_state=0)
6.3. Shift types#
Beyond prior shift, shift_type can request the other two canonical kinds of
dataset shift. Each is realised by a different per-bag operation on the same
fixed population:
|
What changes |
How a bag is built |
|---|---|---|
|
\(P(y)\) — the class balance |
Resample the population to a target prevalence; the labels are the population’s own. |
|
\(P(x)\) — where the features sit |
Translate the bag’s features by a random vector, then label them with
a fixed reference boundary. The boundary stays put while the cloud
slides across it ( |
|
\(P(y \mid x)\) — the labelling rule |
Keep the features in place but rotate the reference boundary and
relabel, so the same point can change class ( |
Both covariate and concept shift use a single reference decision boundary — a logistic-regression rule fit on the population. Covariate shift moves the features across that boundary; concept shift rotates the boundary itself.
6.4. The decision boundary#
Because that boundary is explicit, return_boundary=True hands it back, so the
exact labelling rule is known rather than hidden:
Xs, ys, prevs, boundary = make_quantification(
n_batches=5, shift_type="covariate", return_boundary=True, random_state=0)
boundary is a DecisionBoundary namedtuple with per-bag coef and
intercept arrays. For a binary problem coef has shape
(n_bags, n_features) and intercept (n_bags,) — the hyperplane
coef[i] · x + intercept[i] = 0. Covariate (and prior) bags share one fixed
boundary, so all rows are identical; concept bags each carry their own
rotated boundary.
For k > 2 classes the shapes become (n_bags, k, n_features) and
(n_bags, k) — one weight vector per class — and the label is
argmax(coef[i] · x + intercept[i]), which carves the feature space into k
piecewise-linear regions rather than splitting it with a single line.
6.5. Stacking shifts#
Real data rarely shifts in one way only. Pass a list to shift_type to
combine them:
Xs, ys, prevs = make_quantification(
n_batches=50, shift_type=["prior", "covariate", "concept"],
prevalence="uniform", covariate_scale=0.4, concept_strength=0.3,
random_state=0)
Stacked shifts compose per bag in a fixed order: covariate translates the features, concept rotates the boundary, the bag is relabelled by that boundary, and prior finally resamples it to the target prevalence — so every bag can differ in position, labelling rule and class balance at once.
6.6. Training data and difficulty#
Set return_train=True to also receive a clean training sample (drawn from a
disjoint half of the population at train_prevalence), so you can fit a
quantifier and score it on the shifted bags in a single call:
X_train, y_train, Xs, ys, prevs = make_quantification(
n_batches=100, return_train=True, train_prevalence=[0.5, 0.5],
random_state=0)
How hard the bags are to quantify is governed by the underlying problem:
class_sep (lower is harder), flip_y (more label noise), n_features
(more noise dimensions) and batch_size (smaller bags are noisier).
See also
Visualizing synthetic quantification data and the surrounding gallery section visualise every option above with plots.
make_quantification— the full parameter reference.