.. _representations: .. currentmodule:: mlquantify.representations =============== Representations =============== Representations turn a sample of instances (or classifier scores) into the fixed-length descriptor that a distribution-matching quantifier compares. They are what specialises a single matching skeleton into different methods: swapping the representation is exactly what turns the same mixture-matching idea into HDy, EDy, MMD or KDEy. Role and mechanism ================== A representation :math:`r(\cdot)` maps a set of instances to a vector, so that a candidate prevalence :math:`p` produces a *mixed* representation :math:`\sum_c p_c\, r_c` from the per-class descriptors :math:`r_c`. The quantifier then searches for the :math:`p` whose mixture best matches the test descriptor :math:`r_U` under a chosen loss. Fitting a representation computes the per-class descriptors (``class_representations_``); ``transform`` produces the descriptor of a new sample. :class:`BaseRepresentation` defines the common ``fit`` / ``transform`` interface; custom representations subclass it. The sections below describe each representation type and the descriptor it produces; the :ref:`summary table ` at the end compares them and explains how to choose. .. contents:: Contents :local: :depth: 1 ---- Histogram representation ======================== :class:`HistogramRepresentation` bins the classifier scores (or features) and stores one normalised histogram per class. It is the most widely used representation, and its two most important parameters — ``bins`` and ``bin_edges`` — are easiest to understand by seeing the descriptor they produce. **``bins`` — resolution.** ``bins`` sets how many intervals the score range is split into. Few bins give a coarse, stable summary; many bins capture fine structure but need more data to fill each bin reliably. .. plot:: :caption: More ``bins`` means a finer (but noisier) summary of the same scores. import numpy as np import matplotlib.pyplot as plt from mlquantify.representations import HistogramRepresentation rng = np.random.default_rng(0) scores = rng.beta(2, 5, size=(2000, 1)) # classifier scores in [0, 1] y = (scores[:, 0] > 0.3).astype(int) fig, axes = plt.subplots(1, 3, figsize=(9, 2.8), sharey=True) for ax, n in zip(axes, (5, 10, 25)): rep = HistogramRepresentation(bins=(n,), range=(0.0, 1.0)).fit(scores, y) h = rep.transform(scores) # normalised mass per bin edges = np.linspace(0, 1, n + 1) ax.bar((edges[:-1] + edges[1:]) / 2, h, width=(1.0 / n) * 0.9, color="#2a7ab9", edgecolor="white", linewidth=0.5) ax.set_title(f"bins = {n}", fontsize=10) ax.set_xlabel("classifier score") ax.set_xlim(0, 1) axes[0].set_ylabel("normalised mass") fig.tight_layout() **``bin_edges`` — where the bins are placed.** With ``'fixed'`` (the default) the bins are equal-width over the ``range`` parameter (``[0, 1]`` by default), so bins outside the region where the data actually falls stay empty and are wasted. With ``'auto'`` the edges are learned at ``fit`` time from the data range, so all bins are packed where the scores actually are — giving more usable resolution on skewed or concentrated score distributions. The grey band marks the true data range. .. plot:: :caption: ``bin_edges='fixed'`` wastes bins outside the data; ``'auto'`` adapts. import numpy as np import matplotlib.pyplot as plt from mlquantify.representations import HistogramRepresentation rng = np.random.default_rng(3) scores = 0.2 + 0.3 * rng.beta(2, 3, size=(3000, 1)) # concentrated in [0.2, 0.5] y = (scores[:, 0] > scores[:, 0].mean()).astype(int) n = 12 fig, axes = plt.subplots(1, 2, figsize=(8, 3), sharey=True) rep_fixed = HistogramRepresentation(bins=(n,), range=(0.0, 1.0), bin_edges="fixed").fit(scores, y) hf = rep_fixed.transform(scores) ef = np.linspace(0, 1, n + 1) axes[0].bar((ef[:-1] + ef[1:]) / 2, hf, width=np.diff(ef) * 0.9, color="#b9542a", edgecolor="white", linewidth=0.5) axes[0].set_title("bin_edges='fixed'\n(equal width over range=[0, 1])", fontsize=10) rep_auto = HistogramRepresentation(bins=(n,), bin_edges="auto").fit(scores, y) ha = rep_auto.transform(scores) ea = rep_auto.edges_[0][0] # learned edges axes[1].bar((ea[:-1] + ea[1:]) / 2, ha, width=np.diff(ea) * 0.9, color="#2a9b5c", edgecolor="white", linewidth=0.5) axes[1].set_title("bin_edges='auto'\n(edges fit to the data range)", fontsize=10) for ax in axes: ax.set_xlim(0, 1) ax.set_xlabel("classifier score") ax.axvspan(scores.min(), scores.max(), color="0.85", zorder=0) axes[0].set_ylabel("normalised mass") fig.tight_layout() **Class-conditional distributions — what gets stored.** A fitted :class:`HistogramRepresentation` keeps one normalised histogram per class (``class_representations_``). The descriptor is discriminative only when those per-class distributions differ: when the classifier separates the classes the histograms barely overlap (easy to quantify), and when it does not they look alike (hard), regardless of how the bins are configured. .. plot:: :caption: One histogram per class is stored; matching is easy only when they differ. import numpy as np import matplotlib.pyplot as plt from mlquantify.representations import HistogramRepresentation n = 12 edges = np.linspace(0, 1, n + 1) centers = (edges[:-1] + edges[1:]) / 2 rng = np.random.default_rng(7) fig, axes = plt.subplots(1, 2, figsize=(8.5, 3), sharey=True) scenarios = [("well-separated classes", (2, 6), (6, 2)), ("overlapping classes", (3, 3), (4, 3))] for ax, (title, (an, bn), (ap, bp)) in zip(axes, scenarios): neg = rng.beta(an, bn, size=900) pos = rng.beta(ap, bp, size=600) scores = np.concatenate([neg, pos]).reshape(-1, 1) y = np.concatenate([np.zeros(900), np.ones(600)]).astype(int) rep = HistogramRepresentation(bins=(n,), range=(0, 1)).fit(scores, y) ax.bar(centers, rep.class_representations_[0], width=1 / n * 0.9, alpha=0.6, color="#4477aa", label="negative class") ax.bar(centers, rep.class_representations_[1], width=1 / n * 0.9, alpha=0.6, color="#cc6677", label="positive class") ax.set_title(title, fontsize=10) ax.set_xlabel("classifier score") axes[0].set_ylabel("normalised mass") axes[0].legend(fontsize=8) fig.tight_layout() The remaining parameters are minor: for a single score ``mode='onehot'`` yields essentially the same per-bin vector as the default ``'histogram'``; ``laplace_smoothing=True`` adds a small floor that removes empty bins (stabilising ratio- and log-based distances such as Hellinger); and ``features`` selects which columns are histogrammed. ---- Density (KDE) representation ============================ :class:`KDERepresentation` replaces the bins with a smooth per-class kernel density over the posteriors. ``bandwidth`` is its key control: too small and each class density spikes on its training points (over-fitting); too large and the class densities blur together, erasing the separation the quantifier relies on. .. plot:: :caption: ``bandwidth`` trades off over-fitting (left) against over-smoothing (right). import numpy as np import matplotlib.pyplot as plt from mlquantify.representations import KDERepresentation rng = np.random.default_rng(11) X = np.concatenate([rng.normal(0.3, 0.12, 400), rng.normal(0.65, 0.12, 300)]).reshape(-1, 1) y = np.concatenate([np.zeros(400), np.ones(300)]).astype(int) grid = np.linspace(-0.1, 1.1, 200).reshape(-1, 1) fig, axes = plt.subplots(1, 3, figsize=(9.5, 2.8), sharey=True) for ax, bw, tag in zip(axes, (0.02, 0.1, 0.4), ("too small", "good", "too large")): rep = KDERepresentation(bandwidth=bw).fit(X, y) ax.plot(grid[:, 0], np.exp(rep.class_representations_[0].score_samples(grid)), color="#4477aa", label="negative") ax.plot(grid[:, 0], np.exp(rep.class_representations_[1].score_samples(grid)), color="#cc6677", label="positive") ax.set_title(f"bandwidth = {bw} ({tag})", fontsize=9) ax.set_xlabel("feature") axes[0].set_ylabel("density") axes[0].legend(fontsize=8) fig.tight_layout() Unlike histograms, the KDE is smooth and multivariate, so it scales to several classes on the posterior simplex where bin-based representations fragment. ---- Distance representation ======================= :class:`DistanceRepresentation` summarises a sample by its **mean distance to each class**, producing one ``(n_classes,)`` descriptor. It carries no per-bin shape, but it discriminates because a sample sits closer (on average) to its own class — these distances are the terms of the energy-distance objective. .. plot:: :caption: The descriptor is a sample's mean distance to each class — smallest to its own. import numpy as np import matplotlib.pyplot as plt from mlquantify.representations import DistanceRepresentation rng = np.random.default_rng(0) X0 = rng.normal(-1.3, 0.8, (300, 6)) X1 = rng.normal(1.3, 0.8, (300, 6)) X = np.vstack([X0, X1]) y = np.r_[np.zeros(300), np.ones(300)].astype(int) rep = DistanceRepresentation().fit(X, y) d0 = np.asarray(rep.transform(X[y == 0]), float) # sample drawn from class 0 d1 = np.asarray(rep.transform(X[y == 1]), float) # sample drawn from class 1 x = np.arange(2) fig, ax = plt.subplots(figsize=(5.6, 3)) ax.bar(x - 0.19, d0, width=0.38, color="#4477aa", label="sample from class 0") ax.bar(x + 0.19, d1, width=0.38, color="#cc6677", label="sample from class 1") ax.set_xticks(x) ax.set_xticklabels(["mean dist to class 0", "mean dist to class 1"]) ax.set_ylabel("mean distance") ax.legend(fontsize=8) fig.tight_layout() The ``metric`` parameter sets the ground distance (Euclidean, Manhattan, ...). ---- Kernel mean representation ========================== :class:`KernelMeanRepresentation` embeds a whole sample as a single point — its **mean embedding** in a reproducing-kernel Hilbert space (the mean feature vector under a linear kernel). Each class is summarised by its mean embedding, and MMD matches the test embedding to a mixture of the class embeddings. .. plot:: :caption: Each class is one mean-embedding vector; MMD matches the test mixture to these. import numpy as np import matplotlib.pyplot as plt from mlquantify.representations import KernelMeanRepresentation rng = np.random.default_rng(0) X0 = rng.normal(-1.3, 0.8, (300, 6)) X1 = rng.normal(1.3, 0.8, (300, 6)) X = np.vstack([X0, X1]) y = np.r_[np.zeros(300), np.ones(300)].astype(int) rep = KernelMeanRepresentation(kernel="linear").fit(X, y) e0 = np.asarray(rep.transform(X[y == 0]), float) # class-0 mean embedding e1 = np.asarray(rep.transform(X[y == 1]), float) feat = np.arange(len(e0)) fig, ax = plt.subplots(figsize=(6.2, 3)) ax.bar(feat - 0.19, e0, width=0.38, color="#4477aa", label="negative class") ax.bar(feat + 0.19, e1, width=0.38, color="#cc6677", label="positive class") ax.set_xticks(feat) ax.set_xticklabels([f"f{i}" for i in feat]) ax.set_xlabel("feature") ax.set_ylabel("mean embedding") ax.legend(fontsize=8) fig.tight_layout() The ``kernel`` / ``gamma`` parameters choose the RKHS (an ``'rbf'`` kernel makes the matching universal, capturing all moments rather than just the mean). ---- Prediction representation ========================= :class:`PredictionRepresentation` summarises classifier outputs as either the mean posterior (``method='soft'``) or the class-frequency of the argmax labels (``method='hard'``, i.e. Classify-and-Count). The hard descriptor discards confidence and is more peaked toward the majority predicted class. These descriptors feed the constrained-regression counting methods. .. plot:: :caption: ``'soft'`` keeps the posterior mass; ``'hard'`` collapses it to argmax counts. import numpy as np import matplotlib.pyplot as plt from mlquantify.representations import PredictionRepresentation rng = np.random.default_rng(5) proba = rng.dirichlet([3, 2, 2], size=500) # 3-class posteriors y = proba.argmax(1) soft = np.asarray(PredictionRepresentation(method="soft").fit(proba, y).transform(proba)) hard = np.asarray(PredictionRepresentation(method="hard").fit(proba, y).transform(proba)) x = np.arange(3) fig, ax = plt.subplots(figsize=(5.4, 3)) ax.bar(x - 0.19, soft, width=0.38, color="#2a7ab9", label="method='soft' (mean posterior)") ax.bar(x + 0.19, hard, width=0.38, color="#b9542a", label="method='hard' (class frequency = CC)") ax.set_xticks(x) ax.set_xticklabels([f"class {i}" for i in x]) ax.set_ylabel("descriptor value") ax.legend(fontsize=8) fig.tight_layout() :class:`HardPredictionRepresentation` and :class:`SoftPredictionRepresentation` are the fixed-mode variants. .. _choosing-a-representation: ---- Summary and how to choose ========================= .. list-table:: :widths: 26 36 22 16 :header-rows: 1 * - Representation - What it computes - Used by - Tuned by * - :class:`HistogramRepresentation` - Per-class binned probability-mass vectors of the scores. - DyS, HDy, HDx, GHDy, GHDx - ``bins``, ``bin_edges`` * - :class:`KDERepresentation` - Per-class smooth multivariate densities over posteriors. - KDEyML, KDEyHD, KDEyCS, GKDEyML - ``bandwidth``, ``kernel`` * - :class:`DistanceRepresentation` - Mean distance to each class (energy-distance terms). - EDy, EDx - ``metric`` * - :class:`KernelMeanRepresentation` - Mean embedding of a sample in an RKHS. - MMD_RKHS - ``kernel``, ``gamma`` * - :class:`PredictionRepresentation` - Mean posterior (soft) or class-frequency (hard) vector. - GACC, GPACC, FM - ``method`` **Choosing one:** - **Histogram** — cheap and interpretable, but bin-sensitive and degrades on the high-dimensional posterior simplex; prefer it for binary score matching. - **KDE** — replaces bins with smooth densities and scales to several classes; use it for multiclass density matching. - **Distance** and **kernel mean** — bin-free descriptors that summarise a whole sample as a compact vector, used by the energy-distance and MMD methods. - **Prediction** — the soft/hard descriptors that feed the constrained-regression counting methods. Example ======= .. code-block:: python from mlquantify.representations import HistogramRepresentation rep = HistogramRepresentation(bins=(10,)) rep.fit(train_scores, y_train) test_representation = rep.transform(test_scores) References ========== .. dropdown:: References - González-Castro, V., Alaiz-Rodríguez, R., & Alegre, E. (2013). Class Distribution Estimation Based on the Hellinger Distance. *Information Sciences*, 218, 146–164. (histogram) - Moreo, A., González, P., & del Coz, J. J. (2024). Kernel Density Estimation for Multiclass Quantification. *Machine Learning*, 113, 3075–3107. (KDE) - Iyer, A., Nath, S., & Sarawagi, S. (2014). Maximum Mean Discrepancy for Class Ratio Estimation. *ICML*, 32. (kernel mean) .. seealso:: :ref:`losses` and :ref:`solvers` for the other two elements of the matching triple.