.. _sphx_distribution_matching: =================================== Distribution matching, step by step =================================== Distribution-matching quantifiers such as :class:`~mlquantify.matching.HDy` and :class:`~mlquantify.matching.DyS` estimate prevalence by a simple idea: the histogram of classifier scores on the test set should equal the prevalence-weighted mixture of the *class-conditional* score histograms learned on training. The quantifier searches for the positive-class prevalence :math:`\alpha` whose mixture best matches the observed test histogram. This example reconstructs that mechanism with plain matplotlib so you can see exactly what is being matched. We build the two class-conditional histograms, the observed test histogram, and the best-fit mixture found by sweeping :math:`\alpha`. .. plot:: import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split rng = np.random.default_rng(0) X, y = make_classification( n_samples=6000, n_features=20, weights=[0.5, 0.5], random_state=0, ) X_tr, X_te, y_tr, y_te = train_test_split( X, y, test_size=0.5, random_state=0, ) clf = LogisticRegression(max_iter=1000).fit(X_tr, y_tr) bins = np.linspace(0, 1, 11) def hist(scores): h, _ = np.histogram(scores, bins=bins, density=False) return h / h.sum() s_tr = clf.predict_proba(X_tr)[:, 1] h_neg = hist(s_tr[y_tr == 0]) # class-conditional histograms h_pos = hist(s_tr[y_tr == 1]) # Build a test sample with ~70% positives and histogram its scores. pos = np.where(y_te == 1)[0] neg = np.where(y_te == 0)[0] true_prev = 0.70 n = 800 idx = np.concatenate([ rng.choice(pos, int(true_prev * n), replace=True), rng.choice(neg, n - int(true_prev * n), replace=True), ]) h_test = hist(clf.predict_proba(X_te[idx])[:, 1]) # Sweep alpha and pick the mixture closest to the test histogram (L2). alphas = np.linspace(0, 1, 201) losses = [np.sum(((1 - a) * h_neg + a * h_pos - h_test) ** 2) for a in alphas] a_hat = alphas[int(np.argmin(losses))] mixture = (1 - a_hat) * h_neg + a_hat * h_pos centers = (bins[:-1] + bins[1:]) / 2 fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(11, 4.5)) width = (bins[1] - bins[0]) * 0.9 ax1.bar(centers, h_test, width=width, color="#bdbdbd", label="test histogram") ax1.step(centers, mixture, where="mid", color="#e63946", lw=2, label=f"best-fit mixture ($\\hat\\alpha$={a_hat:.2f})") ax1.step(centers, h_pos, where="mid", color="#457b9d", ls="--", label="class-conditional: positive") ax1.step(centers, h_neg, where="mid", color="#2a9d8f", ls="--", label="class-conditional: negative") ax1.set_xlabel("Classifier score P(y=1)") ax1.set_ylabel("Frequency") ax1.set_title("What is being matched") ax1.legend(fontsize="small") ax2.plot(alphas, losses, color="#264653") ax2.axvline(a_hat, color="#e63946", ls=":", label=f"$\\hat\\alpha$={a_hat:.2f}") ax2.axvline(true_prev, color="k", ls="--", lw=1, label=f"true={true_prev:.2f}") ax2.set_xlabel(r"Candidate positive prevalence $\alpha$") ax2.set_ylabel("Histogram mismatch (L2)") ax2.set_title("The objective being minimised") ax2.legend() fig.tight_layout() The left panel shows the test histogram with the matched mixture sitting almost on top of it; the right panel shows the matching loss as a function of :math:`\alpha`, with its minimum landing close to the true prevalence. Real methods differ mainly in the *distance* they minimise — Hellinger for HDy, Topsøe / a tunable mixture for DyS — and in how finely they search. .. seealso:: - :class:`~mlquantify.matching.HDy`, :class:`~mlquantify.matching.DyS`, :class:`~mlquantify.matching.SORD` — ready-made implementations. - :ref:`sphx_method_comparison` — how DyS compares against other families.