2.7. Distribution Matching#

Distribution matching (DM) methods estimate prevalences without inverting a confusion matrix or re-weighting posteriors. Instead, they find the mixture proportion of class-conditional distributions that best reproduces the observed test distribution. This makes them highly versatile: they can handle non-standard classifiers, non-linear shift, and multiclass problems natively.

Core idea: During training, DM methods learn a representation \(r_c\) of each class’s distribution (a histogram, density, kernel mean, …). At test time, they find the prevalence vector \(\hat{p}\) such that the mixture

\[\sum_c \hat{p}(c) \cdot r_c \;\approx\; r_U\]

is as close as possible to the test representation \(r_U\), under some dissimilarity measure. This is solved as a constrained optimisation over the probability simplex.

mlquantify organises distribution matching into four representation families — histogram, density (KDE), kernel, and scores — all exposed through mlquantify.matching.

2.7.1. Histogram Methods #

Histogram methods discretise the classifier’s score distribution into bins. They are the oldest DM family and remain competitive on binary problems.

Binary-only (unless noted)

Histogram methods are fundamentally binary. mlquantify applies OvR decomposition automatically for multiclass datasets.

2.7.1.1. DyS — Distribution y-Similarity #

DyS (Maletzke et al., 2019) builds a histogram of classifier scores for each class from cross-validated predictions. At test time it searches for the mixture proportion \(\alpha \in [0,1]\) that minimises a dissimilarity measure between the test score histogram and the mixture of class histograms:

\[\hat{p}^{DyS}(\oplus) = \arg\min_{\alpha \in [0,1]} D\!\left( \alpha \cdot H_+ + (1-\alpha) \cdot H_- ,\; H_U \right)\]

Why it excels: DyS is a framework that separates the representation (histogram), the dissimilarity measure, and the solver. It can be configured with any distance and bin size. Maletzke et al. (2019) showed it beats threshold-adjustment methods and matches EMQ on many benchmarks.

2.7.1.2. Parameters #

Parameter	Default	Explanation
`estimator`	`None`	Probabilistic classifier (`predict_proba`). Its cross-validated scores are used to build the class histograms.
`bins_size`	`None`	Number of histogram bins, or an array of bin counts to sweep. If `None`, a default logarithmic grid is used. Larger bins give a coarser representation (more stable but less expressive); smaller bins capture more detail but are noisier with limited training data.
`distance`	`'topsoe'`	Dissimilarity measure between histograms. Options: `'topsoe'` (default) — Jensen–Shannon-like, recommended for DyS. `'hellinger'` — square root of half the chi-squared distance. `'probsymm'` — probabilistic symmetric chi-squared. The distance choice affects which solver is optimal (see `solver`).
`solver`	`'auto'`	Optimisation algorithm for the mixture search: `'auto'` — chooses `'ternary'` for Hellinger/TopSoe/ProbSymm (these have a single minimum), `'grid'` otherwise. `'ternary'` — faster for unimodal objectives. `'grid'` — exhaustive search over a fine grid; always finds the global minimum but slower.
`bin_strategy`	`None`	How to aggregate results across multiple bin sizes. `None` uses a single bin count; `'median'` or `'mean'` averages across all bin sizes in `bins_size` (more robust).
`laplace_smoothing`	`False`	Add Laplace (add-one) smoothing to histogram counts. Prevents zero-bin issues when training data is scarce.
`cv`	`None`	Cross-validation folds for computing training scores. `None` uses 5.
`stratified`	`True`	Stratified folds.
`strategy`	`'ovr'`	Multiclass decomposition.

DyS histogram matching concept — Left: class-conditional score histograms learned from training data. Centre: test score histogram (unlabelled — unknown prevalence). Right: DyS searches for the mixture proportion α that makes α·H⁺ + (1−α)·H⁻ (red step line) match the test histogram (green bars) as closely as possible.#

2.7.1.3. Examples #

Basic binary usage:

from mlquantify.matching import DyS
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, weights=[0.8, 0.2],
                           random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

q = DyS(estimator=LogisticRegression())
q.fit(X_train, y_train)
print(q.predict(X_test))
# {0: 0.79, 1: 0.21}

Customising bins and distance:

import numpy as np
from mlquantify.matching import DyS
from sklearn.linear_model import LogisticRegression

q = DyS(
    estimator=LogisticRegression(),
    bins_size=np.arange(2, 32, 2),   # sweep 2,4,6,...,30 bins
    distance='hellinger',
    bin_strategy='median',           # median across all bin sizes
    laplace_smoothing=True,
)
q.fit(X_train, y_train)
print(q.predict(X_test))

Using aggregate with pre-computed scores:

import numpy as np
from mlquantify.matching import DyS
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression().fit(X_train, y_train)
# Positive-class scores (column 1)
train_scores = clf.predict_proba(X_train)[:, 1]
test_scores  = clf.predict_proba(X_test)[:, 1]

q = DyS(clf)
q.fit(X_train, y_train)
print(q.aggregate(test_scores, train_scores, y_train))

2.7.1.4. HDy — Hellinger Distance y-Similarity #

HDy (González-Castro et al., 2013) is a specific instantiation of the DyS framework that sweeps over multiple bin sizes (\(10, 20, \ldots, 110\) by default) and returns the median prevalence across all bin sizes. It uses the Hellinger distance as the dissimilarity.

Why it exists: HDy was the original paper that introduced the idea of comparing score histograms for quantification. The multi-bin median strategy reduces sensitivity to the bin count hyperparameter, making HDy robust without tuning.

2.7.1.5. Parameters #

Same structure as DyS. Key defaults:

distance='hellinger'
bin_strategy='median'
bins_size=np.linspace(10, 110, 11, dtype=int)

from mlquantify.matching import HDy
from sklearn.linear_model import LogisticRegression

q = HDy(estimator=LogisticRegression())
q.fit(X_train, y_train)
print(q.predict(X_test))

2.7.1.6. HDx — Hellinger Distance x-Similarity (classifier-free)#

HDx (González-Castro et al., 2013) compares class-conditional feature histograms directly, without using a classifier. For each feature, it builds a histogram for each class; the mixture of feature histograms that best matches the test histogram gives the prevalence estimate.

Why it exists: HDx is a non-aggregative histogram method — it does not need a classifier at all. It is useful when no reliable classifier is available, or as a sanity check. Performance is generally below HDy/DyS (which use a classifier’s summary score), but it is a zero-cost baseline.

2.7.1.7. Parameters #

Parameter	Default	Explanation
`bins_size`	`None`	Array of bin counts to sweep (default: `[2,4,6,...,20,30]`).
`strategy`	`'ovr'`	Multiclass decomposition.

from mlquantify.matching import HDx

q = HDx()
q.fit(X_train, y_train)
print(q.predict(X_test))
# No classifier needed

2.7.1.8. SMM — Score Mixture Model #

SMM extends DyS to use a more flexible representation of the score distribution. Parameters match DyS.

2.7.2. Density Methods — KDEy #

KDE-based methods replace histograms with smooth kernel density estimates (KDE), avoiding bin-count sensitivity while still matching distributions.

2.7.2.1. KDEy-HD, KDEy-CS, KDEy-ML #

The three KDEy variants (Moreo et al., 2024) share the same architecture: they build a KDE over classifier posteriors on the training data (for each class) and minimise a dissimilarity to the test KDE at prediction time. They differ in the dissimilarity used:

KDEyHD — Hellinger distance between KDEs.
KDEyCS — squared cosine distance between KDEs.
KDEyML — maximises the mixture log-likelihood (equivalent to minimising negative log-likelihood).

Why they exist: KDEy methods are natively multiclass (unlike DyS/HDy) and avoid the histogram bin-count hyperparameter. Moreo et al. (2024) showed they are state-of-the-art for multiclass quantification.

2.7.2.2. Parameters #

Parameter	Default	Explanation
`estimator`	`None`	Probabilistic classifier.
`bandwidth`	`0.1`	KDE bandwidth (smoothing). Smaller values give sharper densities (more variance), larger values smooth them out (more bias). Use cross-validation to tune: try `[0.01, 0.05, 0.1, 0.2, 0.5]`.
`kernel`	`'gaussian'`	KDE kernel type. `'gaussian'` works well for probability scores in \([0,1]\). Matches `sklearn.neighbors.KernelDensity` options.
`solver`	`'slsqp'`	Optimisation solver for the simplex-constrained mixture problem.
`cv`	`None`	Cross-validation folds.
`stratified`	`True`	Stratified folds.

2.7.2.3. Examples #

Binary with KDEyHD:

from mlquantify.matching import KDEyHD
from sklearn.linear_model import LogisticRegression

q = KDEyHD(estimator=LogisticRegression(), bandwidth=0.1)
q.fit(X_train, y_train)
print(q.predict(X_test))

Multiclass with KDEyML (best accuracy):

from mlquantify.matching import KDEyML
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=800, n_classes=4,
                           n_informative=6, n_redundant=0,
                           random_state=42)
X_train, X_test = X[:600], X[600:]
y_train = y[:600]

q = KDEyML(LogisticRegression(), bandwidth=0.05)
q.fit(X_train, y_train)
print(q.predict(X_test))

Tuning bandwidth with grid search:

from mlquantify.model_selection import GridSearchQ
from mlquantify.matching import KDEyHD
from mlquantify.model_selection import APP
from mlquantify.metrics import MAE
from sklearn.linear_model import LogisticRegression

protocol = APP(batch_size=100, n_prevalences=21, repeats=5)
gs = GridSearchQ(
    quantifier=KDEyHD(LogisticRegression()),
    param_grid={'bandwidth': [0.01, 0.05, 0.1, 0.2, 0.5]},
    protocol=protocol,
    error=MAE,
)
gs.fit(X_train, y_train)
print(gs.best_params_)

Tip

Use KDEyML for multiclass problems — it is consistently the most accurate KDE variant in Moreo et al. (2024)’s benchmark. Use KDEyHD for binary problems if you want a fast alternative to DyS.

2.7.3. Kernel Methods — MMD #

MMD_RKHS (Maximum Mean Discrepancy in Reproducing Kernel Hilbert Space) matches class-conditional kernel mean embeddings of the raw feature vectors, rather than classifier scores. It directly computes the kernel mean of each class in feature space and finds the mixture that minimises the MMD to the test kernel mean.

Why it exists: MMD_RKHS is a non-aggregative method (no classifier needed) that works directly on the feature space via a kernel trick. It is useful when features are naturally compared with a kernel (e.g. strings, graphs, or dense embeddings). Iyer et al. (2014) showed strong convergence guarantees for kernel quantification.

2.7.3.1. Parameters #

Parameter	Default	Explanation
`kernel`	`'rbf'`	Kernel function for computing the RKHS mean embedding. Options: `'rbf'`, `'linear'`, `'poly'`, `'sigmoid'`, `'cosine'`. `'rbf'` works well for continuous features.
`gamma`	`None`	RBF/poly/sigmoid bandwidth. `None` uses `1/n_features`. Tune this if features are on very different scales (consider normalising first).
`degree`	`3`	Polynomial degree for `'poly'` kernel.
`coef0`	`0.0`	Independent term for poly/sigmoid kernels.
`solver`	`'slsqp'`	Optimisation solver.

from mlquantify.matching import MMD_RKHS

q = MMD_RKHS(kernel='rbf', gamma=0.1)
q.fit(X_train, y_train)      # No classifier — works on raw features
print(q.predict(X_test))

2.7.4. Score Methods — SORD #

SORD (Score-based Optimal Ranking Distribution) estimates prevalence by comparing the ranked order of classifier scores between the test set and the mixture of class-conditional training scores, using an earth-mover-style distance.

Why it exists: SORD operates on the continuous score values without discretisation (no bins needed) and without density estimation. It is fast, parameter-free (beyond the classifier), and competitive with histogram methods.

from mlquantify.matching import SORD
from sklearn.linear_model import LogisticRegression

q = SORD(estimator=LogisticRegression())
q.fit(X_train, y_train)
print(q.predict(X_test))

2.7.5. Choosing a Distribution Matching Method #

Method	Multiclass	Needs clf	Key hyperparameter	Best for
DyS	✗ (OvR)	✓	`bins_size`, `distance`	Binary; strong with median-sweep bins.
HDy	✗ (OvR)	✓	`bins_size`	Binary; tuning-free median-sweep baseline.
HDx	✗ (OvR)	✗	`bins_size`	No classifier available; sanity check.
KDEyHD	✓	✓	`bandwidth`	Binary & multiclass; smooth density matching.
KDEyML	✓	✓	`bandwidth`	Multiclass; best overall accuracy.
MMD_RKHS	✓	✗	`kernel`, `gamma`	Kernel-based features; no classifier needed.
SORD	✗ (OvR)	✓	None	Binary; parameter-free, fast.

Practical recommendation:

For binary problems: DyS or MS (threshold-adjustment) are strong.
For multiclass problems: KDEyML is the recommended starting point.
When no classifier is available: HDx or MMD_RKHS.
For a parameter-free binary option: SORD.

2.7.6. Histogram #

Histogram-based DM builds class-conditional histograms of posterior scores and fits the test histogram as a mixture of those class histograms. These methods are binary-first and default to one-vs-rest for multiclass settings.

2.7.6.1. DyS: Distribution y-Similarity Framework #

DyS is a generic framework that formalizes histogram-based matching. It selects the prevalence \(\alpha\) that minimizes a dissimilarity between the test score histogram and the mixture of training histograms [2].

2.7.6.2. HDy: Hellinger Distance y-Similarity #

HDy is a popular instance of DyS that uses the Hellinger distance over histograms of posterior probabilities.

from mlquantify.matching import HDy
from sklearn.ensemble import RandomForestClassifier

q = HDy(estimator=RandomForestClassifier(), bins=10)
q.fit(X_train, y_train)
q.predict(X_test)

2.7.6.3. SMM: Sample Mean Matching #

SMM replaces histograms with a single statistic: the mean score. It solves the mixture matching problem in closed form and is equivalent to PACC [4].

(Source code, png, hires.png, pdf)

../_images/distribution_matching-1.png — Histogram mixtures used by DyS/HDy-like methods.#

2.7.7. Density #

2.7.7.1. KDEy: Kernel Density Estimation y-Similarity #

KDEy is a multi-class DM approach that replaces histograms with continuous densities over the probability simplex, allowing it to model inter-class interactions and avoid binning artifacts [5].

KDEy Concept Illustration — *Illustration of KDEy modeling class-conditional densities on the probability simplex.*#

2.7.7.2. KDEy-ML (Maximum Likelihood)#

The KDEyML class maximizes the likelihood of the test scores under the mixture of KDE class-conditional densities.

2.7.7.3. KDEy-HD (Hellinger Distance)#

The KDEyHD class minimizes the Hellinger distance between the test KDE and the mixture of class-conditional KDEs using Monte Carlo approximation.

2.7.7.4. KDEy-CS (Cauchy-Schwarz)#

The KDEyCS class minimizes the Cauchy-Schwarz divergence with a closed form that leverages kernel Gram matrices.

(Source code, png, hires.png, pdf)

../_images/distribution_matching-2.png — KDE-based density matching over the simplex (illustrative).#

2.7.8. Kernel #

Kernel matching minimizes the distance between the kernel mean embedding of the test sample and the mixture of class-conditional kernel mean embeddings. The MatchingKernelQuantifier base class implements this strategy and the MMD_RKHS quantifier provides the standard RKHS formulation [6].

(Source code, png, hires.png, pdf)

../_images/distribution_matching-3.png — Kernel similarities used for mean matching.#

2.7.9. Scores #

Score-based matching works directly on the score samples rather than binned histograms. The SORD quantifier minimizes a cumulative distance between the test score distribution and the weighted mixture of train scores.

(Source code, png, hires.png, pdf)

../_images/distribution_matching-4.png — Sample-based matching with cumulative score distances.#