6. Synthetic Datasets#

make_quantification is the quantification analogue of sklearn.datasets.make_classification. Where a classification generator hands you one labelled table, a quantification generator must hand you many bags — test samples whose class proportions vary — together with each bag’s true prevalence, because that vector (not the individual labels) is what a quantifier is scored against.

6.1. How it works#

The generator works in two stages:

  1. Build one labelled population. Internally it calls make_classification once to create a pool of n_samples points, whose difficulty you control with class_sep (cluster separation), flip_y (label noise), n_features and weights.

  2. Draw ``n_batches`` bags from that population, where each bag’s class balance, feature positions, or labelling are perturbed according to the chosen shift_type.

The call returns three aligned objects:

from mlquantify.datasets import make_quantification

Xs, ys, prevalences = make_quantification(n_batches=10, random_state=0)
#   Xs           : list of feature matrices, one per bag
#   ys           : list of label vectors,    one per bag
#   prevalences  : (n_batches, n_classes) array of TRUE prevalences

prevalences[i] is the quantification target for bag i — feed Xs[i] to a fitted quantifier and compare its prediction against prevalences[i].

6.2. Controlling the prevalence (prior shift)#

By default shift_type="prior": the class prevalence \(P(y)\) changes from bag to bag while the class-conditional features \(P(x \mid y)\) stay fixed. The prevalence argument decides how each bag’s target prevalence is drawn:

prevalence

Behaviour

"uniform"

Prevalences spread uniformly over the probability simplex — the full range of shifts (the default).

"grid"

A regular grid over the simplex (the Artificial Prevalence Protocol); the bag count is then set by n_prevalences and repeats.

"natural"

Bags drawn i.i.d. from the population, so prevalence fluctuates only with sampling noise around weights.

"dirichlet"

From a Dirichlet centred on target_prevalence with spread set by concentration — high concentration pins bags to the target, low concentration spreads them out.

# bags concentrated near a 70/30 split
Xs, ys, prevs = make_quantification(
    n_batches=30, prevalence="dirichlet",
    target_prevalence=[0.7, 0.3], concentration=150, random_state=0)

6.3. Shift types#

Beyond prior shift, shift_type can request the other two canonical kinds of dataset shift. Each is realised by a different per-bag operation on the same fixed population:

shift_type

What changes

How a bag is built

"prior"

\(P(y)\) — the class balance

Resample the population to a target prevalence; the labels are the population’s own.

"covariate"

\(P(x)\) — where the features sit

Translate the bag’s features by a random vector, then label them with a fixed reference boundary. The boundary stays put while the cloud slides across it (covariate_scale sets the distance).

"concept"

\(P(y \mid x)\) — the labelling rule

Keep the features in place but rotate the reference boundary and relabel, so the same point can change class (concept_strength sets the rotation).

Both covariate and concept shift use a single reference decision boundary — a logistic-regression rule fit on the population. Covariate shift moves the features across that boundary; concept shift rotates the boundary itself.

6.4. The decision boundary#

Because that boundary is explicit, return_boundary=True hands it back, so the exact labelling rule is known rather than hidden:

Xs, ys, prevs, boundary = make_quantification(
    n_batches=5, shift_type="covariate", return_boundary=True, random_state=0)

boundary is a DecisionBoundary namedtuple with per-bag coef and intercept arrays. For a binary problem coef has shape (n_bags, n_features) and intercept (n_bags,) — the hyperplane coef[i] · x + intercept[i] = 0. Covariate (and prior) bags share one fixed boundary, so all rows are identical; concept bags each carry their own rotated boundary.

For k > 2 classes the shapes become (n_bags, k, n_features) and (n_bags, k) — one weight vector per class — and the label is argmax(coef[i] · x + intercept[i]), which carves the feature space into k piecewise-linear regions rather than splitting it with a single line.

6.5. Stacking shifts#

Real data rarely shifts in one way only. Pass a list to shift_type to combine them:

Xs, ys, prevs = make_quantification(
    n_batches=50, shift_type=["prior", "covariate", "concept"],
    prevalence="uniform", covariate_scale=0.4, concept_strength=0.3,
    random_state=0)

Stacked shifts compose per bag in a fixed order: covariate translates the features, concept rotates the boundary, the bag is relabelled by that boundary, and prior finally resamples it to the target prevalence — so every bag can differ in position, labelling rule and class balance at once.

6.6. Training data and difficulty#

Set return_train=True to also receive a clean training sample (drawn from a disjoint half of the population at train_prevalence), so you can fit a quantifier and score it on the shifted bags in a single call:

X_train, y_train, Xs, ys, prevs = make_quantification(
    n_batches=100, return_train=True, train_prevalence=[0.5, 0.5],
    random_state=0)

How hard the bags are to quantify is governed by the underlying problem: class_sep (lower is harder), flip_y (more label noise), n_features (more noise dimensions) and batch_size (smaller bags are noisier).

See also