.. _evaluation_metrics: .. currentmodule:: mlquantify.metrics ================== Evaluation Metrics ================== Quantification metrics measure the discrepancy between the estimated prevalence vector :math:`\hat{p}` and the true prevalence vector :math:`p`. Unlike classification metrics, they operate on **aggregate probability vectors** — not on individual predictions. All metrics in ``mlquantify`` follow the same calling convention: .. code-block:: python error = MetricName(true_prevalences, predicted_prevalences) Both arguments can be: - A flat array of true class labels (``y_true``), or - A prevalence dict/array returned by ``quantifier.predict(X)``. ``mlquantify`` automatically converts true labels to prevalences using :func:`~mlquantify.utils.get_prev_from_labels` when needed. .. contents:: Contents :local: :depth: 2 ---- Absolute Error Metrics ======================= AE and MAE — (Mean) Absolute Error ------------------------------------ .. math:: \text{AE}(p, \hat{p}) = \frac{1}{|\mathcal{Y}|} \sum_{c \in \mathcal{Y}} |p(c) - \hat{p}(c)| :func:`AE` computes the mean absolute difference per class over a *single* sample. :func:`MAE` averages AE over *multiple* samples (from a protocol). **When to use:** MAE is the standard metric for quantification evaluation. It is interpretable (the average error in prevalence units, e.g. 0.05 means "off by 5 percentage points on average"), symmetric, and gives equal weight to all classes and prevalence levels. .. figure:: ../images/metrics_comparison.png :align: center :width: 90% :alt: MAE vs RAE weighting comparison *Left: for a fixed 5 percentage-point absolute error, MAE assigns equal weight regardless of the true prevalence (flat blue line), while RAE's contribution grows steeply as prevalence approaches zero (orange curve). Right: the same 5 pp absolute error applied at different prevalence levels — RAE imposes a much heavier penalty at 5% and 10% prevalence, reflecting that a 5 pp error is far more significant when the true value is 5% than 50%.* .. code-block:: python from mlquantify.metrics import MAE, AE from mlquantify.utils import get_prev_from_labels import numpy as np y_true = np.array([0, 0, 1, 0, 1, 1, 0, 0, 0, 1]) y_pred = {0: 0.62, 1: 0.38} true_prev = get_prev_from_labels(y_true) print(AE(true_prev, y_pred)) # single-sample AE # 0.02 # Over multiple protocol samples errors = [AE(get_prev_from_labels(y_s), q.predict(X_s)) for X_s, y_s in samples] print(MAE(errors)) # mean over all samples SE and MSE — (Mean) Squared Error ----------------------------------- .. math:: \text{SE}(p, \hat{p}) = \frac{1}{|\mathcal{Y}|} \sum_{c \in \mathcal{Y}} (p(c) - \hat{p}(c))^2 :func:`SE` penalises large errors more severely than AE (quadratic vs linear). Use MSE when large deviations are especially harmful. .. code-block:: python from mlquantify.metrics import MSE print(MSE(true_prev, y_pred)) ---- Relative Error Metrics ======================= RAE and NRAE — (Normalised) Relative Absolute Error ----------------------------------------------------- .. math:: \text{RAE}(p, \hat{p}) = \frac{1}{|\mathcal{Y}|} \sum_{c \in \mathcal{Y}} \frac{|p(c) - \hat{p}(c)|}{p(c) + \varepsilon} where :math:`\varepsilon` is a small smoothing constant. **When to use:** RAE amplifies errors at *low prevalences*. An error of 5 percentage points matters much more at 5% prevalence than at 50%. Use RAE when rare classes are important (e.g. rare disease detection, fraud detection). :func:`NRAE` normalises RAE to :math:`[0, 1]` so it is comparable across datasets with different numbers of classes. .. code-block:: python from mlquantify.metrics import RAE, NRAE print(RAE(true_prev, y_pred)) print(NRAE(true_prev, y_pred)) NAE — Normalised Absolute Error -------------------------------- :func:`NAE` normalises AE by the number of classes so results are comparable across different multiclass settings. ---- Divergence Metrics ================== KLD and NKLD — (Normalised) Kullback-Leibler Divergence --------------------------------------------------------- .. math:: \text{KLD}(p, \hat{p}) = \sum_{c \in \mathcal{Y}} p(c) \log\frac{p(c)}{\hat{p}(c)} KLD measures the *information loss* when using :math:`\hat{p}` to approximate :math:`p`. It penalises zero-probability predictions asymptotically (use the smoothed version internally). **When to use:** KLD is used when prevalences represent probability distributions and you care about calibration. It is asymmetric — the true and estimated distributions are not interchangeable — so the convention matters (``KLD(true, pred)``). :func:`NKLD` normalises to :math:`[0, 1]`. .. code-block:: python from mlquantify.metrics import KLD, NKLD print(KLD(true_prev, y_pred)) print(NKLD(true_prev, y_pred)) ---- Ordinal Metrics ================ NMD — Normalised Match Distance --------------------------------- :func:`NMD` is designed for **ordinal** quantification tasks where classes have a natural order (e.g. severity levels: mild < moderate < severe). It measures the earth-mover distance between the CDFs of :math:`p` and :math:`\hat{p}`. **When to use:** Use NMD when your classes are ordered and the distance between adjacent classes matters. For non-ordinal problems, AE or KLD are more appropriate. .. code-block:: python from mlquantify.metrics import NMD # Ordinal classes: 0 < 1 < 2 true_prev = [0.5, 0.3, 0.2] pred_prev = [0.4, 0.4, 0.2] print(NMD(true_prev, pred_prev)) RNOD — Relative Normalised Order Distance ------------------------------------------ :func:`RNOD` is a relative version of NMD that amplifies errors at low prevalences — analogous to RAE for ordinal settings. ---- Distribution Distance Metrics =============================== The following functions also serve as loss functions in distribution-matching quantifiers: .. list-table:: :widths: 20 80 :header-rows: 1 * - Function - Description * - :func:`hellinger` - Hellinger distance between two distributions. :math:`\in [0, 1]`. * - :func:`topsoe` - TopSoe (Jensen-Shannon-like) divergence. * - :func:`probsymm` - Probabilistic symmetric chi-squared divergence. * - :func:`sqEuclidean` - Squared Euclidean distance. These are available both in standard numpy and JAX-compatible variants (``hellinger_jax``, etc.) for gradient-based optimisation. ---- Choosing a Metric ================= .. list-table:: :widths: 15 85 :header-rows: 1 * - Metric - Use when * - **MAE** - Default. Most papers report MAE. Easy to interpret. * - **RAE** - Rare classes matter; you want errors at low prevalences amplified. * - **KLD / NKLD** - Probabilistic calibration of the prevalence vector matters. * - **MSE** - Large estimation errors are especially harmful. * - **NMD** - Classes have a natural order (ordinal quantification). * - **NRAE / NAE / NKLD** - Comparing results across datasets with different class counts. **Full evaluation example:** .. code-block:: python from mlquantify.model_selection import APP from mlquantify.metrics import MAE, RAE, NKLD from mlquantify.utils import get_prev_from_labels from mlquantify.likelihood import EMQ from sklearn.linear_model import LogisticRegression from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split import numpy as np X, y = make_classification(n_samples=2000, weights=[0.8, 0.2], random_state=42) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.5, random_state=42) q = EMQ(LogisticRegression()) q.fit(X_train, y_train) protocol = APP(batch_size=100, n_prevalences=21, repeats=10, random_state=42) maes, raes, nklds = [], [], [] for idx in protocol.split(X_test, y_test): X_s, y_s = X_test[idx], y_test[idx] tp = get_prev_from_labels(y_s) pp = q.predict(X_s) maes.append(MAE(tp, pp)) raes.append(RAE(tp, pp)) nklds.append(NKLD(tp, pp)) print(f"MAE: {np.mean(maes):.4f}") print(f"RAE: {np.mean(raes):.4f}") print(f"NKLD: {np.mean(nklds):.4f}") .. currentmodule:: mlquantify.metrics Evaluation metrics for quantification assess the accuracy of estimated class prevalences against true prevalences. These metrics are crucial for understanding how well a quantifier performs, especially under distributional shifts. The library includes several widely used evaluation metrics: .. list-table:: Metrics :header-rows: 1 :widths: 30 70 * - Metric - Description * - :class:`NMD` - Normalized Match Distance * - :class:`RNOD` - Relative Normalized Overall Deviation * - :class:`VSE` - Variance Shift Error * - :class:`CvM_L1` - Cramér-von Mises L1 Distance * - :class:`AE` - Absolute Error * - :class:`SE` - Squared Error * - :class:`MAE` - Mean Absolute Error * - :class:`MSE` - Mean Squared Error * - :class:`KLD` - Kullback-Leibler Divergence * - :class:`RAE` - Relative Absolute Error * - :class:`NAE` - Normalized Absolute Error * - :class:`NRAE` - Normalized Relative Absolute Error * - :class:`NKLD` - Normalized Kullback-Leibler Divergence ========================================= Single Label Quantification (SLQ) Metrics ========================================= AE (Absolute Error) =================== **Parameters:** - :math:`p`: array-like, shape (n_classes,) True prevalence (distribution of classes). - :math:`\hat{p}`: array-like, shape (n_classes,) Estimated prevalence. AE calculates the simple absolute error across classes: .. math:: \text{AE}(p, \hat{p}) = \sum_{c} |p(c) - \hat{p}(c)| Its primary strength is transparency and ease of interpretation. SE (Squared Error) ================== **Parameters:** - :math:`p`: array-like, shape (n_classes,) True prevalence. - :math:`\hat{p}`: array-like, shape (n_classes,) Estimated prevalence. SE is the sum of squared differences: .. math:: \text{SE}(p, \hat{p}) = \sum_{c} (p(c) - \hat{p}(c))^2 This penalizes larger errors more heavily, making outlier mistakes more obvious. MAE (Mean Absolute Error) ========================= **Parameters:** - :math:`p`: array-like, shape (n_classes,) True prevalence. - :math:`\hat{p}`: array-like, shape (n_classes,) Estimated prevalence. MAE averages the absolute errors over all classes: .. math:: \text{MAE}(p, \hat{p}) = \frac{1}{K} \sum_{c} |p(c) - \hat{p}(c)| It offers a normalized perspective, useful for comparing performances across datasets. MSE (Mean Squared Error) ======================== **Parameters:** - :math:`p`: array-like, shape (n_classes,) True prevalence. - :math:`\hat{p}`: array-like, shape (n_classes,) Estimated prevalence. MSE averages the squared errors: .. math:: \text{MSE}(p, \hat{p}) = \frac{1}{K} \sum_{c} (p(c) - \hat{p}(c))^2 Ideal for highlighting large deviations in prevalence estimation. KLD (Kullback-Leibler Divergence) ================================= **Parameters:** - :math:`p`: array-like, shape (n_classes,) True prevalence. - :math:`\hat{p}`: array-like, shape (n_classes,) Estimated prevalence. KLD measures the information loss between distributions: .. math:: \text{KLD}(p, \hat{p}) = \sum_{c} p(c) \log \frac{p(c)}{\hat{p}(c)} Its key advantage is sensitivity to wrong predictions where the true prevalence is high. RAE (Relative Absolute Error) ============================= **Parameters:** - :math:`p`: array-like, shape (n_classes,) True prevalence. - :math:`\hat{p}`: array-like, shape (n_classes,) Estimated prevalence. - :math:`\epsilon`: float, optional (default=1e-12) Small constant to ensure numerical stability. RAE scales the absolute error by true prevalence: .. math:: \text{RAE}(p, \hat{p}) = \sum_{c} \frac{|p(c) - \hat{p}(c)|}{p(c) + \epsilon} This is beneficial for identifying relative impact in imbalanced scenarios. NAE (Normalized Absolute Error) =============================== **Parameters:** - :math:`p`: array-like, shape (n_classes,) True prevalence. - :math:`\hat{p}`: array-like, shape (n_classes,) Estimated prevalence. NAE normalizes the absolute error: .. math:: \text{NAE}(p, \hat{p}) = \frac{1}{K} \sum_{c} \frac{|p(c) - \hat{p}(c)|}{\max\{p(c), \hat{p}(c)\}} Best used for ensuring error scale invariance. NRAE (Normalized Relative Absolute Error) ========================================= **Parameters:** - :math:`p`: array-like, shape (n_classes,) True prevalence. - :math:`\hat{p}`: array-like, shape (n_classes,) Estimated prevalence. - :math:`\epsilon`: float, optional (default=1e-12) Small constant for numerical stability. NRAE further normalizes relative errors: .. math:: \text{NRAE}(p, \hat{p}) = \frac{1}{K} \sum_{c} \frac{|p(c) - \hat{p}(c)|}{p(c) + \hat{p}(c) + \epsilon} This balances error measurement between true and estimated values. NKLD (Normalized Kullback-Leibler Divergence) ============================================= **Parameters:** - :math:`p`: array-like, shape (n_classes,) True prevalence. - :math:`\hat{p}`: array-like, shape (n_classes,) Estimated prevalence. - :math:`\epsilon`: float, optional (default=1e-12) Small constant for numerical stability. NKLD outputs a normalized form of KLD: .. math:: \text{NKLD}(p, \hat{p}) = \frac{1}{K} \sum_{c} p(c) \log \frac{p(c)}{\hat{p}(c) + \epsilon} This makes it robust for comparing across distinct sample sizes. ============================================ Regression-Based Quantification (RQ) Metrics ============================================ VSE (Variance Shift Error) ========================== **Parameters:** - :math:`p`: array-like, shape (n_classes,) True prevalence. - :math:`\hat{p}`: array-like, shape (n_classes,) Estimated prevalence. The Variance Shift Error quantifies the discrepancy between the variance of true and estimated distributions: .. math:: \text{VSE}(p, \hat{p}) = |\text{Var}(p) - \text{Var}(\hat{p})| This metric emphasizes changes in dispersion, which is useful for detecting model bias towards certain classes. CvM_L1 (Cramér-von Mises L1 Distance) ===================================== **Parameters:** - :math:`p`: array-like, shape (n_classes,) True prevalence. - :math:`\hat{p}`: array-like, shape (n_classes,) Estimated prevalence. CvM_L1 compares cumulative distributions using the L1 norm: .. math:: \text{CvM\_L1}(p, \hat{p}) = \sum_{c} |F_p(c) - F_{\hat{p}}(c)| where \(F_p(c)\) is the cumulative distribution. Its advantage lies in capturing distributional differences beyond pointwise errors. =================================== Ordinal Quantification (OQ) Metrics =================================== NMD (Normalized Match Distance) =============================== **Parameters:** - :math:`p`: array-like, shape (n_classes,) True prevalence. - :math:`\hat{p}`: array-like, shape (n_classes,) Estimated prevalence. The NMD metric quantifies the normalized difference between two prevalence distributions: .. math:: \text{NMD}(p, \hat{p}) = \frac{1}{2} \sum_{c} |p(c) - \hat{p}(c)| where \( p(c) \) is the true prevalence and \( \hat{p}(c) \) is the estimated. The advantage of NMD is its straightforward interpretability and normalization, making it ideal for comparing different quantification methods. RNOD (Relative Normalized Overall Deviation) ============================================ **Parameters:** - :math:`p`: array-like, shape (n_classes,) True prevalence. - :math:`\hat{p}`: array-like, shape (n_classes,) Estimated prevalence. - :math:`\epsilon`: float, optional (default=1e-12) Small constant to ensure numerical stability. RNOD measures the proportional deviation between the true and estimated prevalence, particularly highlighting errors in rare classes: .. math:: \text{RNOD}(p, \hat{p}) = \frac{1}{K} \sum_{c} \frac{|p(c) - \hat{p}(c)|}{p(c) + \epsilon} Its benefit is in handling imbalanced distributions by reducing the influence of dominant classes.