.. _synthetic_datasets: .. currentmodule:: mlquantify.datasets ================== Synthetic Datasets ================== :func:`make_quantification` is the quantification analogue of :func:`sklearn.datasets.make_classification`. Where a classification generator hands you one labelled table, a *quantification* generator must hand you many **bags** — test samples whose class proportions vary — together with each bag's **true prevalence**, because that vector (not the individual labels) is what a quantifier is scored against. How it works ------------------------------------------------------------ The generator works in two stages: 1. **Build one labelled population.** Internally it calls :func:`~sklearn.datasets.make_classification` once to create a pool of ``n_samples`` points, whose difficulty you control with ``class_sep`` (cluster separation), ``flip_y`` (label noise), ``n_features`` and ``weights``. 2. **Draw ``n_batches`` bags from that population**, where each bag's class balance, feature positions, or labelling are perturbed according to the chosen ``shift_type``. The call returns three aligned objects: .. code-block:: python from mlquantify.datasets import make_quantification Xs, ys, prevalences = make_quantification(n_batches=10, random_state=0) # Xs : list of feature matrices, one per bag # ys : list of label vectors, one per bag # prevalences : (n_batches, n_classes) array of TRUE prevalences ``prevalences[i]`` is the quantification target for bag ``i`` — feed ``Xs[i]`` to a fitted quantifier and compare its prediction against ``prevalences[i]``. Controlling the prevalence (prior shift) ------------------------------------------------------------ By default ``shift_type="prior"``: the class prevalence :math:`P(y)` changes from bag to bag while the class-conditional features :math:`P(x \mid y)` stay fixed. The ``prevalence`` argument decides *how* each bag's target prevalence is drawn: .. list-table:: :header-rows: 1 :widths: 22 78 * - ``prevalence`` - Behaviour * - ``"uniform"`` - Prevalences spread uniformly over the probability simplex — the full range of shifts (the default). * - ``"grid"`` - A regular grid over the simplex (the Artificial Prevalence Protocol); the bag count is then set by ``n_prevalences`` and ``repeats``. * - ``"natural"`` - Bags drawn i.i.d. from the population, so prevalence fluctuates only with sampling noise around ``weights``. * - ``"dirichlet"`` - From a Dirichlet centred on ``target_prevalence`` with spread set by ``concentration`` — high concentration pins bags to the target, low concentration spreads them out. .. code-block:: python # bags concentrated near a 70/30 split Xs, ys, prevs = make_quantification( n_batches=30, prevalence="dirichlet", target_prevalence=[0.7, 0.3], concentration=150, random_state=0) Shift types ------------------------------------------------------------ Beyond prior shift, ``shift_type`` can request the other two canonical kinds of dataset shift. Each is realised by a different per-bag operation on the *same* fixed population: .. list-table:: :header-rows: 1 :widths: 16 38 46 * - ``shift_type`` - What changes - How a bag is built * - ``"prior"`` - :math:`P(y)` — the class balance - Resample the population to a target prevalence; the labels are the population's own. * - ``"covariate"`` - :math:`P(x)` — where the features sit - **Translate** the bag's features by a random vector, then label them with a *fixed* reference boundary. The boundary stays put while the cloud slides across it (``covariate_scale`` sets the distance). * - ``"concept"`` - :math:`P(y \mid x)` — the labelling rule - Keep the features in place but **rotate** the reference boundary and relabel, so the same point can change class (``concept_strength`` sets the rotation). Both covariate and concept shift use a single **reference decision boundary** — a logistic-regression rule fit on the population. Covariate shift moves the features across that boundary; concept shift rotates the boundary itself. The decision boundary ------------------------------------------------------------ Because that boundary is explicit, ``return_boundary=True`` hands it back, so the exact labelling rule is known rather than hidden: .. code-block:: python Xs, ys, prevs, boundary = make_quantification( n_batches=5, shift_type="covariate", return_boundary=True, random_state=0) ``boundary`` is a ``DecisionBoundary`` namedtuple with per-bag ``coef`` and ``intercept`` arrays. For a **binary** problem ``coef`` has shape ``(n_bags, n_features)`` and ``intercept`` ``(n_bags,)`` — the hyperplane ``coef[i] · x + intercept[i] = 0``. Covariate (and prior) bags share one fixed boundary, so all rows are identical; **concept** bags each carry their own rotated boundary. For ``k > 2`` classes the shapes become ``(n_bags, k, n_features)`` and ``(n_bags, k)`` — one weight vector per class — and the label is ``argmax(coef[i] · x + intercept[i])``, which carves the feature space into ``k`` piecewise-linear regions rather than splitting it with a single line. Stacking shifts ------------------------------------------------------------ Real data rarely shifts in one way only. Pass a **list** to ``shift_type`` to combine them: .. code-block:: python Xs, ys, prevs = make_quantification( n_batches=50, shift_type=["prior", "covariate", "concept"], prevalence="uniform", covariate_scale=0.4, concept_strength=0.3, random_state=0) Stacked shifts compose per bag in a fixed order: covariate translates the features, concept rotates the boundary, the bag is relabelled by that boundary, and prior finally resamples it to the target prevalence — so every bag can differ in position, labelling rule **and** class balance at once. Training data and difficulty ------------------------------------------------------------ Set ``return_train=True`` to also receive a clean training sample (drawn from a disjoint half of the population at ``train_prevalence``), so you can fit a quantifier and score it on the shifted bags in a single call: .. code-block:: python X_train, y_train, Xs, ys, prevs = make_quantification( n_batches=100, return_train=True, train_prevalence=[0.5, 0.5], random_state=0) How hard the bags are to quantify is governed by the underlying problem: ``class_sep`` (lower is harder), ``flip_y`` (more label noise), ``n_features`` (more noise dimensions) and ``batch_size`` (smaller bags are noisier). .. seealso:: - :ref:`sphx_synthetic_intro` and the surrounding gallery section visualise every option above with plots. - :func:`make_quantification` — the full parameter reference.