.. _quantification_protocols: .. currentmodule:: mlquantify.model_selection ============================== Protocols for Quantification ============================== Evaluating a quantifier on a single test set is misleading — the test prevalence is fixed, so you only see performance at one operating point. Quantification protocols address this by generating **many test batches** with varying prevalences from the same data, giving a fuller picture of method behaviour across the entire prevalence spectrum. .. admonition:: Why protocols matter A quantifier that looks excellent at 50/50 prevalence may fail badly at 5/95. Forman (2005) noted that the choice of evaluation protocol is as important as the choice of method. Standard practice in quantification research is to evaluate across a grid of prevalences (APP) and report the mean error over all samples. .. contents:: Contents :local: :depth: 2 ---- Quick evaluation with ``apply_protocol`` ======================================== :func:`apply_protocol` runs the whole evaluation loop in a single call — the protocol analogue of scikit-learn's :func:`~sklearn.model_selection.cross_validate`. It fits the quantifier, samples the test batches with the chosen protocol, predicts each one, and returns the true and predicted prevalences together with one score array per metric: .. code-block:: python from mlquantify.model_selection import apply_protocol from mlquantify.likelihood import EMQ from sklearn.linear_model import LogisticRegression from sklearn.datasets import make_classification X, y = make_classification(n_samples=2000, weights=[0.7, 0.3], random_state=42) results = apply_protocol( EMQ(LogisticRegression()), X, y, protocol="app", # 'app' | 'npp' | 'upp' | 'ppp' scoring=["mae", "nmd"], # one metric name, a callable, or a list n_prevalences=21, batch_size=100, test_size=0.5, # held-out pool the protocol samples from random_state=42, ) print("samples:", results["n_batches"]) print("MAE:", results["MAE"].mean(), "NMD:", results["NMD"].mean()) # results["true_prevalences"], results["predicted_prevalences"] -> (n_samples, n_classes) By default a copy of the quantifier is trained on ``1 - test_size`` of the data and evaluated on the rest. Pass ``fit=False`` to evaluate an already-fitted quantifier, ``return_estimator=True`` to get the trained model back, or a :class:`BaseProtocol` instance as ``protocol`` for full control. The sections below document the underlying protocols, which you can also drive manually. ---- APP — Artificial Prevalence Protocol ====================================== :class:`APP` is the most widely used evaluation protocol. By default it draws samples from the test set at each prevalence in a uniform grid :math:`\{0, \frac{1}{n-1}, \frac{2}{n-1}, \ldots, 1\}` for the positive class, repeating each prevalence ``repeats`` times. **Why it is standard:** APP ensures every method is evaluated at many prevalence values, not just the natural one. It exposes systematic biases (e.g. methods that only work near 50/50) and gives a fair cross-method comparison. González et al. (2017) review papers routinely use APP as the evaluation backbone. **Choosing how prevalences are produced.** The ``strategy`` parameter selects how the prevalence vectors are drawn over the simplex. ``'grid'`` is the classic systematic sweep; the other strategies *sample* the simplex and scale to many classes without the grid's combinatorial blow-up. :class:`UPP` is simply ``APP`` with a sampling strategy pinned on. Parameters ---------- .. list-table:: :widths: 22 15 63 :header-rows: 1 * - Parameter - Default - Explanation * - ``batch_size`` - required - Number of instances per test sample. Larger batches give more stable prevalence estimates but require a larger test set. A typical choice is 100–500. * - ``n_prevalences`` - ``21`` - Number of equally-spaced prevalence points from ``min_prev`` to ``max_prev``. ``21`` gives a step of 0.05 (i.e. 0%, 5%, 10%, …, 100%). * - ``repeats`` - ``10`` - How many independent samples to draw at each prevalence level. More repeats reduce variance in the average error estimate. Use ≥ 5 for reliable results. * - ``min_prev`` - ``0.0`` - Minimum positive class prevalence in the grid. Leave at 0 to include the all-negative case. * - ``max_prev`` - ``1.0`` - Maximum positive class prevalence. Leave at 1 to include the all-positive case. * - ``strategy`` - ``'grid'`` - How prevalence vectors are generated over the simplex: - ``'grid'`` — a regular lattice of evenly-spaced prevalences from ``min_prev`` to ``max_prev`` (the classic APP). Deterministic and systematic, but the number of points grows combinatorially (:math:`O(n^{k-1})` for ``k`` classes), so it is best for binary or low-class-count problems. - ``'kraemer'`` — the Kraemer method for *uniform* sampling over the simplex. Every prevalence combination is equally likely and the cost is independent of the number of classes, ideal for multiclass. - ``'uniform'`` — uniform sampling via the flat Dirichlet :math:`\mathrm{Dir}(\mathbf{1})`. Statistically equivalent to ``'kraemer'`` but produced through the Dirichlet route; it is exactly ``'dirichlet'`` with ``dirichlet_alpha=1``. - ``'dirichlet'`` — sampling from a Dirichlet whose concentration is set by ``dirichlet_alpha``, letting you *bias* the prevalences (see below). * - ``dirichlet_alpha`` - ``1.0`` - Concentration for ``strategy='dirichlet'``. A scalar is broadcast to a symmetric Dirichlet; an array of length ``n_classes`` sets a per-class concentration. ``alpha > 1`` favours balanced prevalences near the centre of the simplex; ``alpha < 1`` favours extreme, one-class-dominant prevalences near the corners; ``alpha = 1`` is uniform. Ignored by the other strategies. * - ``random_state`` - ``None`` - Seed for reproducible sampling. .. figure:: ../images/app_protocol.png :align: center :width: 90% :alt: APP vs NPP protocol comparison *Left: APP generates test samples at every point on a regular prevalence grid (blue dots), giving systematic coverage from 0% to 100% positive class. Right: NPP draws random sub-samples that cluster near the natural training prevalence (~50%), providing realistic but narrower coverage.* Examples -------- Standard evaluation loop: .. code-block:: python from mlquantify.model_selection import APP from mlquantify.metrics import MAE from mlquantify.utils import get_prev_from_labels from mlquantify.likelihood import EMQ from sklearn.linear_model import LogisticRegression from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split import numpy as np X, y = make_classification(n_samples=2000, weights=[0.7, 0.3], random_state=42) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.5, random_state=42) q = EMQ(LogisticRegression()) q.fit(X_train, y_train) protocol = APP(batch_size=100, n_prevalences=21, repeats=10, random_state=42) errors = [] for idx in protocol.split(X_test, y_test): X_sample, y_sample = X_test[idx], y_test[idx] true_prev = get_prev_from_labels(y_sample) pred_prev = q.predict(X_sample) errors.append(MAE(true_prev, pred_prev)) print(f"Mean MAE over {len(errors)} samples: {np.mean(errors):.4f}") Comparing multiple quantifiers: .. code-block:: python from mlquantify.counting import CC, PCC from mlquantify.likelihood import EMQ from mlquantify.matching import DyS from sklearn.linear_model import LogisticRegression quantifiers = { 'CC': CC(LogisticRegression()), 'PCC': PCC(LogisticRegression()), 'EMQ': EMQ(LogisticRegression()), 'DyS': DyS(LogisticRegression()), } for name, q in quantifiers.items(): q.fit(X_train, y_train) protocol = APP(batch_size=100, n_prevalences=21, repeats=10, random_state=42) results = {name: [] for name in quantifiers} for idx in protocol.split(X_test, y_test): X_s, y_s = X_test[idx], y_test[idx] true_prev = get_prev_from_labels(y_s) for name, q in quantifiers.items(): results[name].append(MAE(true_prev, q.predict(X_s))) for name, errs in results.items(): print(f"{name:5s} MAE={np.mean(errs):.4f}") ---- NPP — Natural Prevalence Protocol =================================== :class:`NPP` draws random sub-samples from the test set without altering the natural class distribution. Each sample has a slightly different prevalence due to random variation, but no artificial manipulation is performed. **Why it exists:** NPP evaluates quantifiers under *real* prevalence variation — how they perform when deployed on random sub-populations drawn from the same underlying distribution as the test set. It is less controlled than APP but more realistic. **Limitation:** Because NPP cannot produce extreme prevalences (e.g. 2% positive) without a very large test set, it gives a narrower view of method behaviour than APP. Parameters ---------- .. list-table:: :widths: 22 15 63 :header-rows: 1 * - Parameter - Default - Explanation * - ``batch_size`` - required - Size of each random sub-sample. * - ``n_samples`` - ``100`` - Number of random sub-samples to draw. * - ``random_state`` - ``None`` - Seed for reproducibility. .. code-block:: python from mlquantify.model_selection import NPP from mlquantify.utils import get_prev_from_labels protocol = NPP(batch_size=100, n_samples=50, random_state=42) for idx in protocol.split(X_test, y_test): X_s, y_s = X_test[idx], y_test[idx] true_prev = get_prev_from_labels(y_s) pred_prev = q.predict(X_s) ---- UPP — Uniform Prevalence Protocol =================================== :class:`UPP` samples prevalence vectors uniformly from the **probability simplex**. It is exactly :class:`APP` with the simplex sampling ``strategy`` pinned on (``'kraemer'`` by default). For binary problems it is similar to APP, but for **multiclass** problems it avoids the combinatorial explosion of sweeping all class-prevalence combinations independently. **Why it exists:** For :math:`k` classes, a grid approach like APP grows as :math:`O(n^{k-1})` which quickly becomes intractable. UPP samples :math:`n` random vectors from the simplex, covering the multiclass prevalence space efficiently without a rigid grid. Maletzke et al. (2020) recommend UPP for multiclass evaluation. Parameters ---------- .. list-table:: :widths: 22 15 63 :header-rows: 1 * - Parameter - Default - Explanation * - ``batch_size`` - required - Size of each sample. * - ``n_prevalences`` - ``100`` - Number of prevalence vectors to sample from the simplex. * - ``strategy`` - ``'kraemer'`` - Simplex sampling strategy, forwarded to :class:`APP` (``'grid'`` is not meaningful here): - ``'kraemer'`` — Kraemer uniform sampling over the simplex. All prevalence combinations are equally likely; cost independent of the number of classes. - ``'uniform'`` — uniform sampling via the flat Dirichlet (:math:`\text{Dir}(\mathbf{1})`); equivalent uniform coverage through the Dirichlet route. - ``'dirichlet'`` — Dirichlet sampling biased by ``dirichlet_alpha`` (see :class:`APP`). * - ``dirichlet_alpha`` - ``1.0`` - Concentration used when ``strategy='dirichlet'``; see :class:`APP`. * - ``algorithm`` - *(deprecated)* - Deprecated alias for ``strategy``; kept for backward compatibility. * - ``min_prev`` - ``0.0`` - Minimum per-class prevalence. Raise (e.g. to ``0.01``) to avoid near-zero classes that are hard to sample from small datasets. * - ``max_prev`` - ``1.0`` - Maximum per-class prevalence. * - ``random_state`` - ``None`` - Seed. .. code-block:: python from mlquantify.model_selection import UPP from mlquantify.utils import get_prev_from_labels from sklearn.datasets import make_classification X, y = make_classification(n_samples=2000, n_classes=4, n_informative=6, n_redundant=0, random_state=42) X_train, X_test = X[:1500], X[1500:] y_train, y_test = y[:1500], y[1500:] protocol = UPP(batch_size=100, n_prevalences=200, strategy='uniform', random_state=42) errors = [] for idx in protocol.split(X_test, y_test): X_s, y_s = X_test[idx], y_test[idx] true_prev = get_prev_from_labels(y_s) pred_prev = q.predict(X_s) errors.append(MAE(true_prev, pred_prev)) ---- PPP — Personalized Prevalence Protocol ======================================== :class:`PPP` generates samples at class prevalences you specify **explicitly**, for targeted evaluation at exact operating points (where APP and UPP sweep the prevalences for you). Pass a list of prevalence vectors; in the binary case a single float is read as the positive-class prevalence. Parameters ---------- .. list-table:: :widths: 22 15 63 :header-rows: 1 * - Parameter - Default - Explanation * - ``batch_size`` - required - Size of each sample. * - ``prevalences`` - required - List of target prevalence vectors (or floats for binary problems). * - ``repeats`` - ``1`` - Number of samples drawn per target prevalence. * - ``random_state`` - ``None`` - Seed for reproducibility. .. code-block:: python from mlquantify.model_selection import PPP from mlquantify.utils import get_prev_from_labels protocol = PPP(batch_size=100, prevalences=[[0.1, 0.9], [0.5, 0.5], [0.9, 0.1]], random_state=42) for idx in protocol.split(X_test, y_test): X_s, y_s = X_test[idx], y_test[idx] true_prev = get_prev_from_labels(y_s) pred_prev = q.predict(X_s) ---- Choosing a Protocol ===================== .. list-table:: :widths: 15 20 65 :header-rows: 1 * - Protocol - Problem type - Use when * - APP - Binary - **Default for binary problems.** Systematic sweep; standard in quantification research. Forman (2005) introduced the concept. * - NPP - Binary / multiclass - You want realistic evaluation under natural prevalence variation. * - UPP (uniform) - Multiclass - **Default for multiclass.** Efficient random coverage of the simplex. * - UPP (kraemer) - Multiclass - You need a deterministic grid equivalent to APP for multiclass. * - PPP - Binary / multiclass - You want to evaluate at specific, hand-picked prevalences. .. tip:: For most workflows, reach for :func:`apply_protocol` rather than writing the loop by hand — it accepts the same ``protocol`` choice and returns the scores directly. .. tip:: Always fix ``random_state`` in protocols when comparing methods so that all quantifiers are evaluated on exactly the same test samples. .. seealso:: :ref:`quantification_foundations` for a conceptual overview of why protocols are necessary. :ref:`model_selection` for hyperparameter tuning with ``GridSearchQ``. References ========== .. dropdown:: References - Forman, G. (2008). Quantifying Counts and Costs via Classification. *Data Mining and Knowledge Discovery*, 17(2), 164–206. - González, P., Castaño, A., Chawla, N. V., & del Coz, J. J. (2017). A Review on Quantification Learning. *ACM Computing Surveys*, 50(5), 1–40. - Esuli, A., Fabris, A., Moreo, A., & Sebastiani, F. (2023). *Learning to Quantify*. The Information Retrieval Series, Springer.