.. _sphx_emq_convergence: =============================== EMQ and the EM prior correction =============================== :class:`~mlquantify.likelihood.EMQ` (also known as SLD or the Saerens–Latinne–Decaestecker method) adjusts a classifier's posteriors to a new test prevalence using **Expectation-Maximisation**. Starting from the training prior, it alternates between (E) re-scaling the posteriors by the current prevalence estimate and (M) averaging them into a new prevalence — repeating until the estimate stops moving. The example below runs that exact loop by hand on a shifted test sample, recording the prevalence after every iteration so we can watch it converge from the (wrong) training prior to the true test prevalence. .. plot:: import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_classification from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split rng = np.random.default_rng(0) X, y = make_classification( n_samples=6000, n_features=20, weights=[0.5, 0.5], random_state=0, ) X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.5, random_state=0) clf = LogisticRegression(max_iter=1000).fit(X_tr, y_tr) train_prior = np.array([(y_tr == 0).mean(), (y_tr == 1).mean()]) # A strongly positive test sample (true positive prevalence = 0.80). pos = np.where(y_te == 1)[0] neg = np.where(y_te == 0)[0] true_prev = 0.80 n = 800 idx = np.concatenate([ rng.choice(pos, int(true_prev * n), replace=True), rng.choice(neg, n - int(true_prev * n), replace=True), ]) Px = clf.predict_proba(X_te[idx]) # The EMQ fixed-point iteration (same update as mlquantify's EMQ.EM). qs = train_prior.copy() history = [qs[1]] for _ in range(25): ratio = qs / train_prior ps = Px * ratio ps /= ps.sum(axis=1, keepdims=True) qs = ps.mean(axis=0) history.append(qs[1]) fig, ax = plt.subplots(figsize=(7, 4.5)) ax.plot(history, "o-", color="#2a9d8f", label="EMQ estimate") ax.axhline(true_prev, color="k", ls="--", lw=1, label=f"true = {true_prev:.2f}") ax.axhline(train_prior[1], color="#e76f51", ls=":", lw=1, label=f"training prior = {train_prior[1]:.2f}") ax.set_xlabel("EM iteration") ax.set_ylabel("Estimated positive prevalence") ax.set_title("EMQ converges from the training prior to the true prevalence") ax.set_ylim(0, 1) ax.legend(loc="center right") fig.tight_layout() The estimate starts at the training prior (0.5, the wrong answer for this sample), then climbs and flattens out near the true 0.80 within a handful of iterations. In practice you never write this loop yourself — ``EMQ(...).predict`` does it for you — but seeing it unrolled makes the method's behaviour concrete. .. seealso:: - :class:`~mlquantify.likelihood.EMQ` — the production implementation, with posterior calibration options. - :ref:`sphx_method_comparison` — EMQ on a diagonal plot.