make_quantification#
- mlquantify.datasets.make_quantification(n_batches=10, batch_size=500, *, n_samples=10000, n_classes=2, n_features=20, n_informative=None, n_redundant=2, n_clusters_per_class=1, class_sep=1.0, flip_y=0.01, weights=None, shift_type='prior', prevalence='uniform', target_prevalence=None, concentration=None, min_prev=0.0, max_prev=1.0, n_prevalences=None, repeats=1, covariate_scale=None, concept_strength=None, return_train=False, train_size=None, train_prevalence=None, return_prevalences=True, return_boundary=False, stack=False, pack='lists', as_frame=False, shuffle=True, random_state=None)[source]#
Generate synthetic quantification bags under prior-probability shift.
The quantification analogue of
sklearn.datasets.make_classification: it builds one labelled population, then drawsn_batchesbags from it, where each bag’s class prevalence is sampled according to a shift strategy. Because the quantification target is a bag’s class distribution (not its per-instance labels), the true prevalence of every bag is returned alongside the data.Three kinds of dataset shift are supported through
shift_type:prior — \(P(y)\) changes (bags resampled to a target prevalence); the clusters keep their position.
covariate — the position of the features changes while the decision boundary stays fixed: each bag’s feature cloud is translated, then labelled by the same fixed boundary, so a class appears in new regions of space.
concept — the decision boundary moves: points stay where they are and are relabelled by a per-bag rotation of a reference boundary.
In every case the returned
prevalencesare the achieved class proportions of each bag.- Parameters:
- n_batchesint, default=10
Number of bags to draw. Ignored when
prevalence='grid'(the grid density then sets the count).- batch_sizeint, tuple(low, high), or sequence of int, default=500
Size of each bag. An int gives equal sizes; a
(low, high)tuple draws a random size per bag; a sequence sets each size explicitly.- n_samplesint, default=10000
Size of the underlying labelled population the bags are drawn from.
- n_classesint, default=2
Number of classes.
- n_features, n_informative, n_redundant, n_clusters_per_classint
Passed to
make_classification. Whenn_informativeisNonea value large enough to place the class clusters is chosen automatically.- class_sepfloat, default=1.0
Class separability. Lower values make quantification harder (and the adjustment of ACC/EMQ/DyS more valuable).
- flip_yfloat, default=0.01
Fraction of labels randomly flipped in the population (label noise).
- weightsarray-like of shape (n_classes,), default=None
Class balance of the population (its prior \(P(y)\)).
- shift_typestr or list of str, default=’prior’
One of
'prior','covariate','concept'(see the summary above), or a list of them to stack for a more diverse dataset, e.g.['prior', 'covariate']. When stacked they compose per bag: covariate translates the features, concept rotates the labelling boundary, and prior resamples to a target prevalence.prevalenceapplies to'prior';covariate_scale/concept_strengthtune the others.- prevalence{‘uniform’, ‘grid’, ‘natural’, ‘dirichlet’} or array-like, default=’uniform’
For
shift_type='prior'— how each bag’s target prevalence is sampled:'uniform'— uniformly over the probability simplex (full range of shifts).'grid'— a regular grid over the simplex (the Artificial Prevalence Protocol); count set byn_prevalences/repeats.'natural'— bags drawn i.i.d. from the population, so prevalence fluctuates aroundweightswith sampling noise only.'dirichlet'— from a Dirichlet centred ontarget_prevalencewith spread controlled byconcentration.an array of shape
(n_batches, n_classes)of explicit vectors.
- target_prevalencearray-like of shape (n_classes,), default=None
Mean prevalence for
prevalence='dirichlet'(defaults to balanced).- concentrationfloat, default=None
Dirichlet total concentration \(\kappa\). Larger values keep bags tightly around
target_prevalence(low variability); smaller values spread them toward extreme shifts. Defaults ton_classes(which reproduces the uniform simplex whentarget_prevalenceis balanced).- min_prev, max_prevfloat, default=0.0, 1.0
Per-class clipping bounds on the sampled prevalences.
- n_prevalences, repeatsint, default=None, 1
Grid density and repetitions for
prevalence='grid'.- covariate_scalefloat, default=None
For
shift_type='covariate'— magnitude of the per-bag feature translation (in feature-std units).0leaves the cloud in place; larger values moveP(x)further. Defaults to 1.5.- concept_strengthfloat, default=None
For
shift_type='concept'— how far the reference decision boundary is rotated per bag (radians-scale).0keeps the base boundary; larger values move it more. Defaults to 0.5.- return_trainbool, default=False
If
True, also return a dedicated training sample drawn from a disjoint half of the population.- train_sizeint, default=None
Size of the returned training sample (defaults to the whole training half).
- train_prevalencearray-like of shape (n_classes,), default=None
Prevalence of the training sample (defaults to the natural population prior).
- return_prevalencesbool, default=True
If
True, also return the(n_bags, n_classes)array of each bag’s true prevalence.- return_boundarybool, default=False
If
True, also return aDecisionBoundarynamedtuple capturing the linear boundary used for each bag (coefandinterceptstacked over bags). Covariate and prior bags share one fixed boundary; concept bags each carry their own rotated boundary — so the object records exactly how the rule moves, with no need to re-fit a classifier. For 2-D data, draw bagifromcoef[i]/intercept[i].- stackbool, default=False
If
True, stack the bags into(n_bags, batch_size, n_features)and(n_bags, batch_size)arrays. Requires equal bag sizes.- pack{‘lists’, ‘flat’}, default=’lists’
'lists'returnsXsandysas lists of per-bag arrays.'flat'spreads them as(X1, ..., Xn, y1, ..., yn)(incompatible withreturn_train).- as_framebool, default=False
Return each bag’s
Xas a pandasDataFrameandyas aSeries.- shufflebool, default=True
Shuffle instances within each bag.
- random_stateint, Generator or None, default=None
Controls the population and the sampling.
- Returns:
- The return is a tuple assembled from the following, in order: the optional
- training sample
X_train, y_train(only ifreturn_train), the bags Xs, ys(lists, or stacked arrays, or spread out whenpack='flat'),- the
prevalencesarray (only ifreturn_prevalences), and the fitted - decision
boundary(only ifreturn_boundary). With the defaults this - is
(Xs, ys, prevalences).
See also
mlquantify.model_selection.apply_protocolRun a protocol over real data.
Examples
>>> from mlquantify.datasets import make_quantification >>> Xs, ys, prevs = make_quantification(n_batches=3, random_state=0) >>> len(Xs), len(ys), prevs.shape (3, 3, (3, 2)) >>> # Bags concentrated near a 70/30 split: >>> Xs, ys, prevs = make_quantification( ... n_batches=20, prevalence="dirichlet", ... target_prevalence=[0.7, 0.3], concentration=150, random_state=0)