5. Model Selection and Evaluation#

Evaluating and selecting quantification models requires dedicated protocols and metrics — standard classification tools (train/test split, accuracy, F1) are not sufficient because quantification performance depends on the prevalence distribution of the test data, not just individual labels.

This section covers the full evaluation workflow:

  1. Protocols — how to generate many test samples with varying prevalences from a single dataset (APP, UPP, NPP).

  2. Hyperparameter tuning — how to use GridSearchQ to select the best quantifier configuration.

  3. Evaluation metrics — which error measure to use and when.

Quick example — full evaluation pipeline:

from mlquantify.likelihood import EMQ
from mlquantify.model_selection import APP
from mlquantify.metrics import MAE
from mlquantify.utils import get_prev_from_labels
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import numpy as np

X, y = make_classification(n_samples=2000, weights=[0.7, 0.3],
                           random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.5, random_state=42)

q = EMQ(LogisticRegression())
q.fit(X_train, y_train)

protocol = APP(batch_size=100, n_prevalences=21, repeats=10,
               random_state=42)

errors = []
for idx in protocol.split(X_test, y_test):
    X_s, y_s = X_test[idx], y_test[idx]
    tp = get_prev_from_labels(y_s)
    pp = q.predict(X_s)
    errors.append(MAE(tp, pp))

print(f"Mean MAE: {np.mean(errors):.4f} ± {np.std(errors):.4f}")