.. _quantification_foundations: ========================== Quantification Foundations ========================== This page introduces the core theory behind quantification — what the problem is, why it differs from classification, how dataset shift motivates it, and how the main method families address it. Reading this page will help you understand **why** every parameter in every quantifier exists. .. contents:: Contents :local: :depth: 2 ---- What Is Quantification? ======================= **Quantification** (also called *class prevalence estimation* or *class prior estimation*) is the task of estimating the **proportion** — or *prevalence* — of each class in an unlabelled dataset, rather than labelling individual instances. Given: - A labelled training set :math:`L = \{(x_i, y_i)\}_{i=1}^{n}` with :math:`y_i \in \{c_1, \ldots, c_k\}`, - An unlabelled test set :math:`U = \{x_j\}_{j=1}^{m}`, the goal is to estimate the vector of class prevalences .. math:: \hat{p}(c) = \frac{|\{x \in U : y = c\}|}{|U|}, \quad \forall c \in \mathcal{Y}. The output is a probability vector :math:`\hat{p} \in \Delta^{k-1}` (the probability simplex), where :math:`\hat{p}(c) \ge 0` and :math:`\sum_c \hat{p}(c) = 1`. .. admonition:: Quickstart analogy Imagine a hospital receives a batch of 1,000 blood samples. A pathologist does **not** need to diagnose every single patient — they need to know *how many* samples belong to each disease category to allocate resources. Quantification answers this aggregate question directly. When Is Quantification the Right Tool? --------------------------------------- Quantification is the right tool when: - The final decision is about a **population**, not an individual (e.g. "how many tweets are about a product complaint?"). - Labelling every instance is too expensive, but estimating proportions is sufficient. - The class distribution is expected to **shift** between training and deployment. - Evaluation uses aggregate metrics (e.g. proportion of defective items in a production batch). If you need a label for *each* instance, use a classifier. If you need the proportions of a batch, use a quantifier. ---- Why Not Just Classify and Count? ================================== The simplest approach — train a classifier, predict labels, count how many fall into each class — is called **Classify and Count (CC)**. It is a valid baseline, but it is systematically biased whenever the class distribution in the test set differs from training. The CC Bias ----------- Suppose a binary classifier is trained on a balanced dataset (50% positive, 50% negative) and achieves 90% accuracy. Now consider deploying it on a batch where only 5% of instances are truly positive. .. figure:: ../images/cc_bias.png :align: center :width: 75% :alt: CC and PCC bias across prevalences *CC (red) systematically overshoots at low prevalence and undershoots at high prevalence. PCC (orange) is less biased but still not corrected. The dashed grey line is the ideal unbiased estimator.* The classifier will generate: - True positives: :math:`0.90 \times 0.05 = 0.045` (from 5% positives it gets right) - False positives: :math:`0.10 \times 0.95 = 0.095` (from 95% negatives it misclassifies) CC's estimated positive prevalence is :math:`0.045 + 0.095 = 0.14`, while the true value is :math:`0.05` — an error of **9 percentage points**, even with a very accurate classifier! Forman (2005) showed empirically that CC consistently overestimates the minority class when the test prevalence is low and underestimates it when high, regardless of classifier accuracy. The bias is *systematic* and does not vanish with more data. (Forman, 2005) .. admonition:: Key insight A classifier optimises *instance-level* accuracy, not *aggregate-level* accuracy. The two objectives are different. Specialised quantification methods optimise directly for the latter. ---- Dataset Shift ============= Quantification is intimately linked to **dataset shift** — the phenomenon where the distribution of data changes between training and deployment. (Moreno-Torres et al., 2012) .. figure:: ../images/dataset_shift.png :align: center :width: 90% :alt: Prior probability shift illustration *Prior probability shift: the feature distributions within each class (the histogram shapes) are identical between training and test, but the class proportions differ — 30% positive in training, 80% positive at test time.* The three most relevant shift types for quantification are: Prior Probability Shift (Label Shift) -------------------------------------- The class-conditional distribution stays the same, but class priors change: .. math:: P_U(X \mid Y) = P_L(X \mid Y), \quad P_U(Y) \neq P_L(Y). This is the *primary assumption* for most quantification methods. A sentiment classifier trained in January may face a burst of positive reviews after a product launch — the features of "positive review" have not changed, but the proportion has. **Methods that assume this shift:** CC, ACC, PCC, PACC, EMQ/SLD, DyS, HDy, KDEy. Covariate Shift --------------- The input distribution changes, but the conditional label distribution is stable: .. math:: P_U(Y \mid X) = P_L(Y \mid X), \quad P_U(X) \neq P_L(X). This occurs when the input features come from a different domain (e.g. a model trained on news articles applied to social media posts). Concept Shift (Concept Drift) ------------------------------ The relationship between features and labels changes: .. math:: P_U(Y \mid X) \neq P_L(Y \mid X). This is the hardest case. Most standard quantification methods cannot fully correct for concept shift. Monitoring for drift and retraining are necessary. .. tip:: When you are unsure which shift applies, start with methods that assume **prior probability shift** (the most common case). If performance is poor, check whether the feature distribution has also shifted. ---- The Aggregative Quantification Framework ========================================= Most quantification methods in ``mlquantify`` follow the **aggregative framework** introduced by (Esuli et al., 2023): 1. **Fit** — Train an underlying classifier (estimator) on labelled data, obtaining a *representation* of the training class distributions (confusion matrix, score histograms, density estimates, …). 2. **Predict** — Apply the estimator to each test instance to obtain predictions (hard labels or soft probabilities). 3. **Aggregate** — Combine the individual predictions into a single prevalence estimate for the batch. This design mirrors ``scikit-learn``'s *estimator* API: every aggregative quantifier exposes ``fit(X, y)``, ``predict(X)``, and an additional ``aggregate`` method for when predictions are already available. .. code-block:: python from mlquantify.counting import PCC from sklearn.linear_model import LogisticRegression # 1. Build and fit q = PCC(LogisticRegression()) q.fit(X_train, y_train) # 2 & 3 combined — predict on new data prevalences = q.predict(X_test) # Or use aggregate if you already have posteriors proba = q.estimator_.predict_proba(X_test) prevalences = q.aggregate(proba) .. _cross-validation-role: The Role of Cross-Validation in ``fit`` ---------------------------------------- Many aggregative methods need to estimate *how the classifier behaves on unseen data* (e.g. to estimate TPR/FPR for ACC, or score histograms for DyS/HDy). Using training-set predictions directly would give over-optimistic estimates. To avoid this, ``mlquantify`` uses **cross-validated predictions** (also called *hold-out predictions* or *calibrated predictions*) obtained by fitting the estimator on a subset of the training data and predicting the held-out subset. This is controlled by the ``cv`` parameter in ``fit``: .. list-table:: :widths: 20 80 :header-rows: 1 * - ``cv`` - Behaviour * - ``int`` (e.g. 5) - K-fold cross-validation; predictions are assembled from K non-overlapping folds. Recommended. * - ``None`` - The method uses its own default (usually 5-fold). Safe choice. * - Cross-validator object - E.g. ``StratifiedKFold(n_splits=10)``. Full control. .. code-block:: python from mlquantify.counting import ACC from sklearn.svm import SVC # Use 10-fold stratified CV to estimate TPR/FPR q = ACC(SVC(probability=True), cv=10, stratified=True) q.fit(X_train, y_train) .. warning:: Setting ``cv`` to a very small value (e.g. 2) on a small dataset can produce noisy TPR/FPR or histogram estimates and hurt performance. The default of 5 is a safe balance between bias and variance. The ``estimator_fitted`` parameter ------------------------------------ If you have already trained your classifier (e.g. in a pipeline), pass ``estimator_fitted=True`` to ``fit`` so ``mlquantify`` skips retraining and uses the existing model: .. code-block:: python from sklearn.ensemble import GradientBoostingClassifier clf = GradientBoostingClassifier().fit(X_train, y_train) # Skip refitting — use the already-trained clf q = PCC(clf) q.fit(X_train, y_train, estimator_fitted=True) ---- Method Families at a Glance ============================ .. figure:: ../images/method_comparison.png :align: center :width: 85% :alt: Illustrative error profiles of main quantification method families *Illustrative error profile across positive-class prevalence levels (synthetic data — for concept illustration only). CC and PCC are most biased at extreme prevalences. Adjusted counting (ACC/MS) partially corrects this. EMQ, DyS and KDEy achieve consistently low error across the full prevalence range.* ``mlquantify`` organises methods into families based on the **representation** they build during training and the **correction strategy** they apply. .. list-table:: :widths: 20 25 25 30 :header-rows: 1 * - Family - Representation - Correction - When to use * - :ref:`Counting ` (CC, PCC) - Hard / soft labels - None - Baseline; when training/test distributions are similar. * - :ref:`Adjusted Counting ` (ACC, TAC, …) - Confusion matrix / ROC curve - Linear bias correction - Binary problems with known classifier error rates. * - :ref:`Likelihood ` (EMQ, CDE) - Posterior probabilities - EM-based prior correction - Best single method under prior probability shift. * - :ref:`Distribution Matching ` (DyS, HDy, KDEy) - Score histograms / densities - Mixture optimisation - Strong performance; good for binary and multiclass. * - :ref:`Nearest Neighbours ` (PWK) - Feature space - Imbalance-aware k-NN - When a simple, interpretable baseline is needed. * - :ref:`Neural ` (QuaNet) - Deep embeddings - Direct prevalence learning - Large datasets with rich feature representations. * - :ref:`Ensemble ` (EnsembleQ, QuaDapt) - Any base quantifier - Diversity + selection - Robustness; when test prevalence is highly variable. .. tip:: If you are just getting started, try :class:`~mlquantify.likelihood.EMQ` with ``LogisticRegression`` — it is consistently among the top performers and requires minimal tuning. (Saerens et al., 2002) (Esuli et al., 2023) ---- Choosing the Right Evaluation Protocol ======================================== A single train/test split is often misleading for quantification because the test prevalence is fixed. Dedicated protocols generate many test samples with **different prevalences** from the same dataset: - **APP (Artificial Prevalence Protocol)** — sweeps prevalences from 0 to 1 in uniform steps. Standard in quantification research. Use this by default. - **UPP (Uniform Prevalence Protocol)** — samples prevalences uniformly from the simplex. Preferred for multiclass problems. - **NPP (Natural Prevalence Protocol)** — draws random subsets preserving natural class proportions. Less controlled, more realistic. See :ref:`quantification_protocols` for full details and code examples. Choosing the Right Evaluation Metric -------------------------------------- Quantification metrics penalise deviation between the estimated and true prevalence vectors. The most important ones in ``mlquantify`` are: .. list-table:: :widths: 15 30 55 :header-rows: 1 * - Metric - Import - When to use * - MAE - ``mlquantify.metrics.MAE`` - Default. Mean absolute difference per class. Interpretable. * - RAE - ``mlquantify.metrics.RAE`` - When errors at low prevalences should be amplified (relative AE). * - KLD - ``mlquantify.metrics.KLD`` - When probabilistic calibration of prevalences matters. * - NKLD - ``mlquantify.metrics.NKLD`` - Normalised KLD; comparable across different class counts. * - MSE - ``mlquantify.metrics.MSE`` - Squared error; penalises large deviations more heavily. .. code-block:: python from mlquantify.metrics import MAE, RAE from mlquantify.utils import get_prev_from_labels true_prev = get_prev_from_labels(y_test) pred_prev = quantifier.predict(X_test) print(f"MAE: {MAE(true_prev, pred_prev):.4f}") print(f"RAE: {RAE(true_prev, pred_prev):.4f}") ---- Multiclass Quantification ========================== All methods in ``mlquantify`` that are marked as *binary-only* are automatically extended to multiclass problems via a **decomposition strategy**: - **One-vs-Rest (OvR)** — for each class :math:`c`, train a binary quantifier that estimates :math:`\hat{p}(c)` vs. :math:`1 - \hat{p}(c)`, then renormalise. This is the default (``strategy='ovr'``). - **One-vs-One (OvO)** — train a binary quantifier for every pair of classes :math:`(c_i, c_j)`, then combine estimates. Less common, set ``strategy='ovo'``. Natively multiclass methods (CC, PCC, GACC, GPACC, EMQ, KDEy) operate directly on :math:`k` classes without decomposition. .. code-block:: python from mlquantify.counting import ACC from sklearn.linear_model import LogisticRegression from sklearn.datasets import make_classification X, y = make_classification(n_samples=500, n_classes=4, n_informative=6, n_redundant=0, random_state=42) # OvR decomposition is applied automatically q = ACC(LogisticRegression(), strategy='ovr') q.fit(X[:400], y[:400]) q.predict(X[400:]) ---- Further Reading =============== For a comprehensive treatment of quantification theory, the canonical reference is: - **Esuli, A., Fabris, A., Moreo, A., & Sebastiani, F. (2023).** *Learning to Quantify.* Springer. Open access at https://doi.org/10.1007/978-3-031-20467-8 For the original CC/AC paper: - **Forman, G. (2005).** Counting Positives Accurately Despite Inaccurate Classification. *ECML 2005*, pp. 564–575. For a comprehensive survey of quantification methods: - **González, J., Díez, J., Chawla, N., & del Coz, J. J. (2017).** A Review on Quantification Learning. *ACM Computing Surveys*, 50(5), 1–40.