.. _distribution_matching: .. currentmodule:: mlquantify.matching ====================== Distribution Matching ====================== Distribution matching (DM) methods estimate prevalences without inverting a confusion matrix or re-weighting posteriors. Instead, they find the mixture proportion of class-conditional distributions that best **reproduces** the observed test distribution. This makes them highly versatile: they can handle non-standard classifiers, non-linear shift, and multiclass problems natively. **Core idea:** During training, DM methods learn a *representation* :math:`r_c` of each class's distribution (a histogram, density, kernel mean, …). At test time, they find the prevalence vector :math:`\hat{p}` such that the mixture .. math:: \sum_c \hat{p}(c) \cdot r_c \;\approx\; r_U is as close as possible to the test representation :math:`r_U`, under some dissimilarity measure. This is solved as a constrained optimisation over the probability simplex. ``mlquantify`` organises distribution matching into four representation families — **histogram**, **density (KDE)**, **kernel**, and **scores** — all exposed through :mod:`mlquantify.matching`. .. contents:: Contents :local: :depth: 2 ---- Histogram Methods ================= Histogram methods discretise the classifier's score distribution into bins. They are the oldest DM family and remain competitive on binary problems. .. admonition:: Binary-only (unless noted) Histogram methods are fundamentally binary. ``mlquantify`` applies OvR decomposition automatically for multiclass datasets. DyS — Distribution y-Similarity --------------------------------- :class:`DyS` (Maletzke et al., 2019) builds a histogram of classifier scores for each class from cross-validated predictions. At test time it searches for the mixture proportion :math:`\alpha \in [0,1]` that minimises a dissimilarity measure between the test score histogram and the mixture of class histograms: .. math:: \hat{p}^{DyS}(\oplus) = \arg\min_{\alpha \in [0,1]} D\!\left( \alpha \cdot H_+ + (1-\alpha) \cdot H_- ,\; H_U \right) **Why it excels:** DyS is a *framework* that separates the representation (histogram), the dissimilarity measure, and the solver. It can be configured with any distance and bin size. Maletzke et al. (2019) showed it beats threshold-adjustment methods and matches EMQ on many benchmarks. Parameters ---------- .. list-table:: :widths: 22 15 63 :header-rows: 1 * - Parameter - Default - Explanation * - ``estimator`` - ``None`` - Probabilistic classifier (``predict_proba``). Its cross-validated scores are used to build the class histograms. * - ``bins_size`` - ``None`` - Number of histogram bins, or an array of bin counts to sweep. If ``None``, a default logarithmic grid is used. Larger bins give a coarser representation (more stable but less expressive); smaller bins capture more detail but are noisier with limited training data. * - ``distance`` - ``'topsoe'`` - Dissimilarity measure between histograms. Options: - ``'topsoe'`` (default) — Jensen–Shannon-like, recommended for DyS. - ``'hellinger'`` — square root of half the chi-squared distance. - ``'probsymm'`` — probabilistic symmetric chi-squared. The distance choice affects which solver is optimal (see ``solver``). * - ``solver`` - ``'auto'`` - Optimisation algorithm for the mixture search: - ``'auto'`` — chooses ``'ternary'`` for Hellinger/TopSoe/ProbSymm (these have a single minimum), ``'grid'`` otherwise. - ``'ternary'`` — faster for unimodal objectives. - ``'grid'`` — exhaustive search over a fine grid; always finds the global minimum but slower. * - ``bin_strategy`` - ``None`` - How to aggregate results across multiple bin sizes. ``None`` uses a single bin count; ``'median'`` or ``'mean'`` averages across all bin sizes in ``bins_size`` (more robust). * - ``laplace_smoothing`` - ``False`` - Add Laplace (add-one) smoothing to histogram counts. Prevents zero-bin issues when training data is scarce. * - ``cv`` - ``None`` - Cross-validation folds for computing training scores. ``None`` uses 5. * - ``stratified`` - ``True`` - Stratified folds. * - ``strategy`` - ``'ovr'`` - Multiclass decomposition. .. figure:: ../images/histogram_matching.png :align: center :width: 95% :alt: DyS histogram matching concept *Left: class-conditional score histograms learned from training data. Centre: test score histogram (unlabelled — unknown prevalence). Right: DyS searches for the mixture proportion α that makes α·H⁺ + (1−α)·H⁻ (red step line) match the test histogram (green bars) as closely as possible.* Examples -------- Basic binary usage: .. code-block:: python from mlquantify.matching import DyS from sklearn.linear_model import LogisticRegression from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split X, y = make_classification(n_samples=1000, weights=[0.8, 0.2], random_state=42) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42) q = DyS(estimator=LogisticRegression()) q.fit(X_train, y_train) print(q.predict(X_test)) # {0: 0.79, 1: 0.21} Customising bins and distance: .. code-block:: python import numpy as np from mlquantify.matching import DyS from sklearn.linear_model import LogisticRegression q = DyS( estimator=LogisticRegression(), bins_size=np.arange(2, 32, 2), # sweep 2,4,6,...,30 bins distance='hellinger', bin_strategy='median', # median across all bin sizes laplace_smoothing=True, ) q.fit(X_train, y_train) print(q.predict(X_test)) Using :meth:`aggregate` with pre-computed scores: .. code-block:: python import numpy as np from mlquantify.matching import DyS from sklearn.linear_model import LogisticRegression clf = LogisticRegression().fit(X_train, y_train) # Positive-class scores (column 1) train_scores = clf.predict_proba(X_train)[:, 1] test_scores = clf.predict_proba(X_test)[:, 1] q = DyS(clf) q.fit(X_train, y_train) print(q.aggregate(test_scores, train_scores, y_train)) ---- HDy — Hellinger Distance y-Similarity --------------------------------------- :class:`HDy` (González-Castro et al., 2013) is a specific instantiation of the DyS framework that sweeps over multiple bin sizes (:math:`10, 20, \ldots, 110` by default) and returns the **median** prevalence across all bin sizes. It uses the Hellinger distance as the dissimilarity. **Why it exists:** HDy was the original paper that introduced the idea of comparing score histograms for quantification. The multi-bin median strategy reduces sensitivity to the bin count hyperparameter, making HDy robust without tuning. Parameters ---------- Same structure as :class:`DyS`. Key defaults: - ``distance='hellinger'`` - ``bin_strategy='median'`` - ``bins_size=np.linspace(10, 110, 11, dtype=int)`` .. code-block:: python from mlquantify.matching import HDy from sklearn.linear_model import LogisticRegression q = HDy(estimator=LogisticRegression()) q.fit(X_train, y_train) print(q.predict(X_test)) ---- HDx — Hellinger Distance x-Similarity (classifier-free) --------------------------------------------------------- :class:`HDx` (González-Castro et al., 2013) compares class-conditional **feature histograms** directly, without using a classifier. For each feature, it builds a histogram for each class; the mixture of feature histograms that best matches the test histogram gives the prevalence estimate. **Why it exists:** HDx is a *non-aggregative* histogram method — it does not need a classifier at all. It is useful when no reliable classifier is available, or as a sanity check. Performance is generally below HDy/DyS (which use a classifier's summary score), but it is a zero-cost baseline. Parameters ---------- .. list-table:: :widths: 22 15 63 :header-rows: 1 * - Parameter - Default - Explanation * - ``bins_size`` - ``None`` - Array of bin counts to sweep (default: ``[2,4,6,...,20,30]``). * - ``strategy`` - ``'ovr'`` - Multiclass decomposition. .. code-block:: python from mlquantify.matching import HDx q = HDx() q.fit(X_train, y_train) print(q.predict(X_test)) # No classifier needed ---- SMM — Score Mixture Model -------------------------- :class:`SMM` extends DyS to use a more flexible representation of the score distribution. Parameters match :class:`DyS`. ---- Density Methods — KDEy ======================== KDE-based methods replace histograms with smooth **kernel density estimates** (KDE), avoiding bin-count sensitivity while still matching distributions. KDEy-HD, KDEy-CS, KDEy-ML ---------------------------- The three KDEy variants (Moreo et al., 2024) share the same architecture: they build a KDE over classifier posteriors on the training data (for each class) and minimise a dissimilarity to the test KDE at prediction time. They differ in the dissimilarity used: - :class:`KDEyHD` — Hellinger distance between KDEs. - :class:`KDEyCS` — squared cosine distance between KDEs. - :class:`KDEyML` — maximises the mixture log-likelihood (equivalent to minimising negative log-likelihood). **Why they exist:** KDEy methods are **natively multiclass** (unlike DyS/HDy) and avoid the histogram bin-count hyperparameter. Moreo et al. (2024) showed they are state-of-the-art for multiclass quantification. Parameters ---------- .. list-table:: :widths: 22 15 63 :header-rows: 1 * - Parameter - Default - Explanation * - ``estimator`` - ``None`` - Probabilistic classifier. * - ``bandwidth`` - ``0.1`` - KDE bandwidth (smoothing). Smaller values give sharper densities (more variance), larger values smooth them out (more bias). Use cross-validation to tune: try ``[0.01, 0.05, 0.1, 0.2, 0.5]``. * - ``kernel`` - ``'gaussian'`` - KDE kernel type. ``'gaussian'`` works well for probability scores in :math:`[0,1]`. Matches ``sklearn.neighbors.KernelDensity`` options. * - ``solver`` - ``'slsqp'`` - Optimisation solver for the simplex-constrained mixture problem. * - ``cv`` - ``None`` - Cross-validation folds. * - ``stratified`` - ``True`` - Stratified folds. Examples -------- Binary with KDEyHD: .. code-block:: python from mlquantify.matching import KDEyHD from sklearn.linear_model import LogisticRegression q = KDEyHD(estimator=LogisticRegression(), bandwidth=0.1) q.fit(X_train, y_train) print(q.predict(X_test)) Multiclass with KDEyML (best accuracy): .. code-block:: python from mlquantify.matching import KDEyML from sklearn.linear_model import LogisticRegression from sklearn.datasets import make_classification X, y = make_classification(n_samples=800, n_classes=4, n_informative=6, n_redundant=0, random_state=42) X_train, X_test = X[:600], X[600:] y_train = y[:600] q = KDEyML(LogisticRegression(), bandwidth=0.05) q.fit(X_train, y_train) print(q.predict(X_test)) Tuning bandwidth with grid search: .. code-block:: python from mlquantify.model_selection import GridSearchQ from mlquantify.matching import KDEyHD from mlquantify.model_selection import APP from mlquantify.metrics import MAE from sklearn.linear_model import LogisticRegression protocol = APP(batch_size=100, n_prevalences=21, repeats=5) gs = GridSearchQ( quantifier=KDEyHD(LogisticRegression()), param_grid={'bandwidth': [0.01, 0.05, 0.1, 0.2, 0.5]}, protocol=protocol, error=MAE, ) gs.fit(X_train, y_train) print(gs.best_params_) .. tip:: Use **KDEyML** for multiclass problems — it is consistently the most accurate KDE variant in Moreo et al. (2024)'s benchmark. Use **KDEyHD** for binary problems if you want a fast alternative to DyS. ---- Kernel Methods — MMD ====================== :class:`MMD_RKHS` (Maximum Mean Discrepancy in Reproducing Kernel Hilbert Space) matches class-conditional **kernel mean embeddings** of the raw feature vectors, rather than classifier scores. It directly computes the kernel mean of each class in feature space and finds the mixture that minimises the MMD to the test kernel mean. **Why it exists:** MMD_RKHS is a **non-aggregative** method (no classifier needed) that works directly on the feature space via a kernel trick. It is useful when features are naturally compared with a kernel (e.g. strings, graphs, or dense embeddings). Iyer et al. (2014) showed strong convergence guarantees for kernel quantification. Parameters ---------- .. list-table:: :widths: 22 15 63 :header-rows: 1 * - Parameter - Default - Explanation * - ``kernel`` - ``'rbf'`` - Kernel function for computing the RKHS mean embedding. Options: ``'rbf'``, ``'linear'``, ``'poly'``, ``'sigmoid'``, ``'cosine'``. ``'rbf'`` works well for continuous features. * - ``gamma`` - ``None`` - RBF/poly/sigmoid bandwidth. ``None`` uses ``1/n_features``. Tune this if features are on very different scales (consider normalising first). * - ``degree`` - ``3`` - Polynomial degree for ``'poly'`` kernel. * - ``coef0`` - ``0.0`` - Independent term for poly/sigmoid kernels. * - ``solver`` - ``'slsqp'`` - Optimisation solver. .. code-block:: python from mlquantify.matching import MMD_RKHS q = MMD_RKHS(kernel='rbf', gamma=0.1) q.fit(X_train, y_train) # No classifier — works on raw features print(q.predict(X_test)) ---- Score Methods — SORD ====================== :class:`SORD` (Score-based Optimal Ranking Distribution) estimates prevalence by comparing the *ranked order* of classifier scores between the test set and the mixture of class-conditional training scores, using an earth-mover-style distance. **Why it exists:** SORD operates on the continuous score values without discretisation (no bins needed) and without density estimation. It is fast, parameter-free (beyond the classifier), and competitive with histogram methods. .. code-block:: python from mlquantify.matching import SORD from sklearn.linear_model import LogisticRegression q = SORD(estimator=LogisticRegression()) q.fit(X_train, y_train) print(q.predict(X_test)) ---- Choosing a Distribution Matching Method ========================================= .. list-table:: :widths: 12 12 12 18 46 :header-rows: 1 * - Method - Multiclass - Needs clf - Key hyperparameter - Best for * - DyS - ✗ (OvR) - ✓ - ``bins_size``, ``distance`` - Binary; strong with median-sweep bins. * - HDy - ✗ (OvR) - ✓ - ``bins_size`` - Binary; tuning-free median-sweep baseline. * - HDx - ✗ (OvR) - ✗ - ``bins_size`` - No classifier available; sanity check. * - KDEyHD - ✓ - ✓ - ``bandwidth`` - Binary & multiclass; smooth density matching. * - KDEyML - ✓ - ✓ - ``bandwidth`` - **Multiclass; best overall accuracy.** * - MMD_RKHS - ✓ - ✗ - ``kernel``, ``gamma`` - Kernel-based features; no classifier needed. * - SORD - ✗ (OvR) - ✓ - None - Binary; parameter-free, fast. **Practical recommendation:** - For **binary** problems: **DyS** or **MS** (threshold-adjustment) are strong. - For **multiclass** problems: **KDEyML** is the recommended starting point. - When no classifier is available: **HDx** or **MMD_RKHS**. - For a parameter-free binary option: **SORD**. .. seealso:: :ref:`likelihood` for EMQ, which is often as good as or better than DM methods under pure prior probability shift. Distribution Matching (DM) methods estimate prevalences by matching the test distribution to a mixture of class-conditional distributions learned on the training data. In practice, the matching strategy depends on how distributions are represented. The matching module is organized around four representation families: - **Histogram:** histogram-based matching (DyS, HDy, SMM). - **Density:** KDE-based matching over the probability simplex (KDEy variants). - **Kernel:** kernel mean matching in RKHS (MMD_RKHS). - **Scores:** matching directly on score samples (SORD). .. dropdown:: Mathematical details - Mixture Formulation The observed distribution in the test set is approximated as: .. math:: D_U \approx \hat{p} \cdot D_+ + (1 - \hat{p}) \cdot D_- DM methods search for the mixture parameter :math:`\hat{p}` that minimizes a chosen dissimilarity between the test distribution and the mixture. .. dropdown:: References .. [1] Forman, G. (2008). Quantifying counts and costs via classification. Data Mining and Knowledge Discovery, 17(2), 164-206. https://doi.org/10.1007/s10618-008-0097-y Histogram ========= Histogram-based DM builds class-conditional histograms of posterior scores and fits the test histogram as a mixture of those class histograms. These methods are **binary-first** and default to one-vs-rest for multiclass settings. DyS: Distribution y-Similarity Framework ---------------------------------------- **DyS** is a generic framework that formalizes histogram-based matching. It selects the prevalence :math:`\alpha` that minimizes a dissimilarity between the test score histogram and the mixture of training histograms [2]_. .. dropdown:: Mathematical details - DyS Optimization .. math:: \hat{p}^{DyS}(\oplus) = \alpha^* = \operatorname*{arg\,min}_{0 \le \alpha \le 1} \{ DS(\alpha f_{L^{\oplus}} + (1-\alpha) f_{L^{\ominus}}, f_U) \} HDy: Hellinger Distance y-Similarity ------------------------------------ **HDy** is a popular instance of DyS that uses the Hellinger distance over histograms of posterior probabilities. .. code-block:: python from mlquantify.matching import HDy from sklearn.ensemble import RandomForestClassifier q = HDy(estimator=RandomForestClassifier(), bins=10) q.fit(X_train, y_train) q.predict(X_test) .. dropdown:: Mathematical details - HDy Bin Adjustment .. math:: \frac{|D'_i|}{|D'|} = \frac{|D^+_i|}{|D^+|} \cdot \hat{p} + \frac{|D^-_i|}{|D^-|} \cdot (1 - \hat{p}) SMM: Sample Mean Matching ------------------------- **SMM** replaces histograms with a single statistic: the mean score. It solves the mixture matching problem in closed form and is equivalent to PACC [4]_. .. dropdown:: Mathematical details - SMM Closed Form .. math:: \alpha = \frac{\mu[S_U] - \mu[S_{\ominus}]}{\mu[S_{\oplus}] - \mu[S_{\ominus}]} .. plot:: :align: center :caption: Histogram mixtures used by DyS/HDy-like methods. import numpy as np import matplotlib.pyplot as plt rng = np.random.default_rng(2) pos = rng.normal(0.7, 0.1, 800) neg = rng.normal(0.3, 0.1, 800) mix = np.concatenate([pos[:600], neg[:400]]) bins = np.linspace(0, 1, 21) plt.hist(pos, bins=bins, alpha=0.5, label="positive") plt.hist(neg, bins=bins, alpha=0.5, label="negative") plt.hist(mix, bins=bins, histtype="step", linewidth=2, label="test") plt.xlim(0, 1) plt.legend() .. dropdown:: References .. [2] Maletzke, A., dos Reis, D., Cherman, E., & Batista, G. (2019). DyS: A Framework for Mixture Models in Quantification. AAAI. .. [3] González-Castro, V., Alaiz-Rodríguez, R., & Alegre, E. (2013). Class distribution estimation based on the Hellinger distance. Information Sciences, 218, 146-164. https://doi.org/10.1016/j.ins.2012.05.028 .. [4] Hassan, W., Maletzke, A., & Batista, G. (2020). Accurately quantifying a billion instances per second. IEEE DSAA. Density ======= KDEy: Kernel Density Estimation y-Similarity -------------------------------------------- **KDEy** is a multi-class DM approach that replaces histograms with continuous densities over the probability simplex, allowing it to model inter-class interactions and avoid binning artifacts [5]_. .. figure:: ../images/kdey-concept.png :align: center :width: 80% :alt: KDEy Concept Illustration *Illustration of KDEy modeling class-conditional densities on the probability simplex.* KDEy-ML (Maximum Likelihood) ---------------------------- The :class:`KDEyML` class maximizes the likelihood of the test scores under the mixture of KDE class-conditional densities. .. dropdown:: Mathematical details - KDEy-ML Optimization .. math:: \hat{\alpha} = \operatorname*{arg\,min}_{\alpha \in \Delta^{n-1}} \left( - \sum_{x \in U} \log \left( \sum_{i=1}^{n} \alpha_i \cdot p_{\tilde{L}_i}(x) \right) \right) KDEy-HD (Hellinger Distance) ---------------------------- The :class:`KDEyHD` class minimizes the Hellinger distance between the test KDE and the mixture of class-conditional KDEs using Monte Carlo approximation. KDEy-CS (Cauchy-Schwarz) ------------------------ The :class:`KDEyCS` class minimizes the Cauchy-Schwarz divergence with a closed form that leverages kernel Gram matrices. .. plot:: :align: center :caption: KDE-based density matching over the simplex (illustrative). import numpy as np import matplotlib.pyplot as plt x = np.linspace(0.01, 0.99, 200) pos = np.exp(-0.5 * ((x - 0.75) / 0.08) ** 2) neg = np.exp(-0.5 * ((x - 0.25) / 0.08) ** 2) mix = 0.6 * pos + 0.4 * neg plt.plot(x, pos, label="positive KDE") plt.plot(x, neg, label="negative KDE") plt.plot(x, mix, linestyle="--", label="mixture") plt.legend() Kernel ====== Kernel matching minimizes the distance between the kernel mean embedding of the test sample and the mixture of class-conditional kernel mean embeddings. The :class:`MatchingKernelQuantifier` base class implements this strategy and the :class:`MMD_RKHS` quantifier provides the standard RKHS formulation [6]_. .. plot:: :align: center :caption: Kernel similarities used for mean matching. import numpy as np import matplotlib.pyplot as plt x = np.linspace(-2, 2, 200) gamma = 1.5 k_rbf = np.exp(-gamma * (x ** 2)) plt.plot(x, k_rbf, label="rbf kernel") plt.axhline(0, color="0.8", linewidth=1) plt.legend() .. dropdown:: References .. [6] Zhang, K., Schölkopf, B., Muandet, K., & Wang, Z. (2013). Domain Adaptation under Target and Conditional Shift. ICML. Scores ====== Score-based matching works directly on the score samples rather than binned histograms. The :class:`SORD` quantifier minimizes a cumulative distance between the test score distribution and the weighted mixture of train scores. .. plot:: :align: center :caption: Sample-based matching with cumulative score distances. import numpy as np import matplotlib.pyplot as plt rng = np.random.default_rng(0) pos = np.sort(rng.normal(0.7, 0.12, 200)) neg = np.sort(rng.normal(0.3, 0.12, 200)) test = np.sort(np.concatenate([pos[:120], neg[:80]])) y_pos = np.linspace(0, 1, len(pos)) y_neg = np.linspace(0, 1, len(neg)) y_test = np.linspace(0, 1, len(test)) plt.plot(pos, y_pos, label="positive CDF") plt.plot(neg, y_neg, label="negative CDF") plt.plot(test, y_test, linestyle="--", label="test CDF") plt.legend() .. dropdown:: References .. [5] Moreo, A., González, P., & del Coz, J. J. (2024). Kernel Density Estimation for Multiclass Quantification. http://arxiv.org/abs/2401.00490 .. [7] Maletzke, A., dos Reis, D., Hassan, W., & Batista, G. (2021). Accurately Quantifying under Score Variability.