make_quantification#

mlquantify.datasets.make_quantification(n_batches=10, batch_size=500, *, n_samples=10000, n_classes=2, n_features=20, n_informative=None, n_redundant=2, n_clusters_per_class=1, class_sep=1.0, flip_y=0.01, weights=None, shift_type='prior', prevalence='uniform', target_prevalence=None, concentration=None, min_prev=0.0, max_prev=1.0, n_prevalences=None, repeats=1, covariate_scale=None, concept_strength=None, return_train=False, train_size=None, train_prevalence=None, return_prevalences=True, return_boundary=False, stack=False, pack='lists', as_frame=False, shuffle=True, random_state=None)[source]#

Generate synthetic quantification bags under prior-probability shift.

The quantification analogue of sklearn.datasets.make_classification: it builds one labelled population, then draws n_batches bags from it, where each bag’s class prevalence is sampled according to a shift strategy. Because the quantification target is a bag’s class distribution (not its per-instance labels), the true prevalence of every bag is returned alongside the data.

Three kinds of dataset shift are supported through shift_type:

  • prior\(P(y)\) changes (bags resampled to a target prevalence); the clusters keep their position.

  • covariate — the position of the features changes while the decision boundary stays fixed: each bag’s feature cloud is translated, then labelled by the same fixed boundary, so a class appears in new regions of space.

  • concept — the decision boundary moves: points stay where they are and are relabelled by a per-bag rotation of a reference boundary.

In every case the returned prevalences are the achieved class proportions of each bag.

Parameters:
n_batchesint, default=10

Number of bags to draw. Ignored when prevalence='grid' (the grid density then sets the count).

batch_sizeint, tuple(low, high), or sequence of int, default=500

Size of each bag. An int gives equal sizes; a (low, high) tuple draws a random size per bag; a sequence sets each size explicitly.

n_samplesint, default=10000

Size of the underlying labelled population the bags are drawn from.

n_classesint, default=2

Number of classes.

n_features, n_informative, n_redundant, n_clusters_per_classint

Passed to make_classification. When n_informative is None a value large enough to place the class clusters is chosen automatically.

class_sepfloat, default=1.0

Class separability. Lower values make quantification harder (and the adjustment of ACC/EMQ/DyS more valuable).

flip_yfloat, default=0.01

Fraction of labels randomly flipped in the population (label noise).

weightsarray-like of shape (n_classes,), default=None

Class balance of the population (its prior \(P(y)\)).

shift_typestr or list of str, default=’prior’

One of 'prior', 'covariate', 'concept' (see the summary above), or a list of them to stack for a more diverse dataset, e.g. ['prior', 'covariate']. When stacked they compose per bag: covariate translates the features, concept rotates the labelling boundary, and prior resamples to a target prevalence. prevalence applies to 'prior'; covariate_scale / concept_strength tune the others.

prevalence{‘uniform’, ‘grid’, ‘natural’, ‘dirichlet’} or array-like, default=’uniform’

For shift_type='prior' — how each bag’s target prevalence is sampled:

  • 'uniform' — uniformly over the probability simplex (full range of shifts).

  • 'grid' — a regular grid over the simplex (the Artificial Prevalence Protocol); count set by n_prevalences / repeats.

  • 'natural' — bags drawn i.i.d. from the population, so prevalence fluctuates around weights with sampling noise only.

  • 'dirichlet' — from a Dirichlet centred on target_prevalence with spread controlled by concentration.

  • an array of shape (n_batches, n_classes) of explicit vectors.

target_prevalencearray-like of shape (n_classes,), default=None

Mean prevalence for prevalence='dirichlet' (defaults to balanced).

concentrationfloat, default=None

Dirichlet total concentration \(\kappa\). Larger values keep bags tightly around target_prevalence (low variability); smaller values spread them toward extreme shifts. Defaults to n_classes (which reproduces the uniform simplex when target_prevalence is balanced).

min_prev, max_prevfloat, default=0.0, 1.0

Per-class clipping bounds on the sampled prevalences.

n_prevalences, repeatsint, default=None, 1

Grid density and repetitions for prevalence='grid'.

covariate_scalefloat, default=None

For shift_type='covariate' — magnitude of the per-bag feature translation (in feature-std units). 0 leaves the cloud in place; larger values move P(x) further. Defaults to 1.5.

concept_strengthfloat, default=None

For shift_type='concept' — how far the reference decision boundary is rotated per bag (radians-scale). 0 keeps the base boundary; larger values move it more. Defaults to 0.5.

return_trainbool, default=False

If True, also return a dedicated training sample drawn from a disjoint half of the population.

train_sizeint, default=None

Size of the returned training sample (defaults to the whole training half).

train_prevalencearray-like of shape (n_classes,), default=None

Prevalence of the training sample (defaults to the natural population prior).

return_prevalencesbool, default=True

If True, also return the (n_bags, n_classes) array of each bag’s true prevalence.

return_boundarybool, default=False

If True, also return a DecisionBoundary namedtuple capturing the linear boundary used for each bag (coef and intercept stacked over bags). Covariate and prior bags share one fixed boundary; concept bags each carry their own rotated boundary — so the object records exactly how the rule moves, with no need to re-fit a classifier. For 2-D data, draw bag i from coef[i] / intercept[i].

stackbool, default=False

If True, stack the bags into (n_bags, batch_size, n_features) and (n_bags, batch_size) arrays. Requires equal bag sizes.

pack{‘lists’, ‘flat’}, default=’lists’

'lists' returns Xs and ys as lists of per-bag arrays. 'flat' spreads them as (X1, ..., Xn, y1, ..., yn) (incompatible with return_train).

as_framebool, default=False

Return each bag’s X as a pandas DataFrame and y as a Series.

shufflebool, default=True

Shuffle instances within each bag.

random_stateint, Generator or None, default=None

Controls the population and the sampling.

Returns:
The return is a tuple assembled from the following, in order: the optional
training sample X_train, y_train (only if return_train), the bags
Xs, ys (lists, or stacked arrays, or spread out when pack='flat'),
the prevalences array (only if return_prevalences), and the fitted
decision boundary (only if return_boundary). With the defaults this
is (Xs, ys, prevalences).

See also

mlquantify.model_selection.apply_protocol

Run a protocol over real data.

Examples

>>> from mlquantify.datasets import make_quantification
>>> Xs, ys, prevs = make_quantification(n_batches=3, random_state=0)
>>> len(Xs), len(ys), prevs.shape
(3, 3, (3, 2))
>>> # Bags concentrated near a 70/30 split:
>>> Xs, ys, prevs = make_quantification(   
...     n_batches=20, prevalence="dirichlet",
...     target_prevalence=[0.7, 0.3], concentration=150, random_state=0)