5.1. Protocols for Quantification#
Evaluating a quantifier on a single test set is misleading — the test prevalence is fixed, so you only see performance at one operating point. Quantification protocols address this by generating many test batches with varying prevalences from the same data, giving a fuller picture of method behaviour across the entire prevalence spectrum.
Why protocols matter
A quantifier that looks excellent at 50/50 prevalence may fail badly at 5/95. Forman (2005) noted that the choice of evaluation protocol is as important as the choice of method. Standard practice in quantification research is to evaluate across a grid of prevalences (APP) and report the mean error over all samples.
5.1.1. Quick evaluation with apply_protocol#
apply_protocol runs the whole evaluation loop in a single call — the
protocol analogue of scikit-learn’s
cross_validate. It fits the quantifier, samples
the test batches with the chosen protocol, predicts each one, and returns the
true and predicted prevalences together with one score array per metric:
from mlquantify.model_selection import apply_protocol
from mlquantify.likelihood import EMQ
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=2000, weights=[0.7, 0.3], random_state=42)
results = apply_protocol(
EMQ(LogisticRegression()), X, y,
protocol="app", # 'app' | 'npp' | 'upp' | 'ppp'
scoring=["mae", "nmd"], # one metric name, a callable, or a list
n_prevalences=21,
batch_size=100,
test_size=0.5, # held-out pool the protocol samples from
random_state=42,
)
print("samples:", results["n_batches"])
print("MAE:", results["MAE"].mean(), "NMD:", results["NMD"].mean())
# results["true_prevalences"], results["predicted_prevalences"] -> (n_samples, n_classes)
By default a copy of the quantifier is trained on 1 - test_size of the data
and evaluated on the rest. Pass fit=False to evaluate an already-fitted
quantifier, return_estimator=True to get the trained model back, or a
BaseProtocol instance as protocol for full control. The sections
below document the underlying protocols, which you can also drive manually.
5.1.2. APP — Artificial Prevalence Protocol#
APP is the most widely used evaluation protocol. By default it draws
samples from the test set at each prevalence in a uniform grid
\(\{0, \frac{1}{n-1}, \frac{2}{n-1}, \ldots, 1\}\) for the positive
class, repeating each prevalence repeats times.
Why it is standard: APP ensures every method is evaluated at many prevalence values, not just the natural one. It exposes systematic biases (e.g. methods that only work near 50/50) and gives a fair cross-method comparison. González et al. (2017) review papers routinely use APP as the evaluation backbone.
Choosing how prevalences are produced. The strategy parameter selects
how the prevalence vectors are drawn over the simplex. 'grid' is the
classic systematic sweep; the other strategies sample the simplex and scale
to many classes without the grid’s combinatorial blow-up. UPP is
simply APP with a sampling strategy pinned on.
5.1.2.1. Parameters#
Parameter |
Default |
Explanation |
|---|---|---|
|
required |
Number of instances per test sample. Larger batches give more stable prevalence estimates but require a larger test set. A typical choice is 100–500. |
|
|
Number of equally-spaced prevalence points from |
|
|
How many independent samples to draw at each prevalence level. More repeats reduce variance in the average error estimate. Use ≥ 5 for reliable results. |
|
|
Minimum positive class prevalence in the grid. Leave at 0 to include the all-negative case. |
|
|
Maximum positive class prevalence. Leave at 1 to include the all-positive case. |
|
|
How prevalence vectors are generated over the simplex:
|
|
|
Concentration for |
|
|
Seed for reproducible sampling. |
Left: APP generates test samples at every point on a regular prevalence grid (blue dots), giving systematic coverage from 0% to 100% positive class. Right: NPP draws random sub-samples that cluster near the natural training prevalence (~50%), providing realistic but narrower coverage.#
5.1.2.2. Examples#
Standard evaluation loop:
from mlquantify.model_selection import APP
from mlquantify.metrics import MAE
from mlquantify.utils import get_prev_from_labels
from mlquantify.likelihood import EMQ
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
import numpy as np
X, y = make_classification(n_samples=2000, weights=[0.7, 0.3],
random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.5, random_state=42)
q = EMQ(LogisticRegression())
q.fit(X_train, y_train)
protocol = APP(batch_size=100, n_prevalences=21, repeats=10,
random_state=42)
errors = []
for idx in protocol.split(X_test, y_test):
X_sample, y_sample = X_test[idx], y_test[idx]
true_prev = get_prev_from_labels(y_sample)
pred_prev = q.predict(X_sample)
errors.append(MAE(true_prev, pred_prev))
print(f"Mean MAE over {len(errors)} samples: {np.mean(errors):.4f}")
Comparing multiple quantifiers:
from mlquantify.counting import CC, PCC
from mlquantify.likelihood import EMQ
from mlquantify.matching import DyS
from sklearn.linear_model import LogisticRegression
quantifiers = {
'CC': CC(LogisticRegression()),
'PCC': PCC(LogisticRegression()),
'EMQ': EMQ(LogisticRegression()),
'DyS': DyS(LogisticRegression()),
}
for name, q in quantifiers.items():
q.fit(X_train, y_train)
protocol = APP(batch_size=100, n_prevalences=21, repeats=10,
random_state=42)
results = {name: [] for name in quantifiers}
for idx in protocol.split(X_test, y_test):
X_s, y_s = X_test[idx], y_test[idx]
true_prev = get_prev_from_labels(y_s)
for name, q in quantifiers.items():
results[name].append(MAE(true_prev, q.predict(X_s)))
for name, errs in results.items():
print(f"{name:5s} MAE={np.mean(errs):.4f}")
5.1.3. NPP — Natural Prevalence Protocol#
NPP draws random sub-samples from the test set without altering
the natural class distribution. Each sample has a slightly different
prevalence due to random variation, but no artificial manipulation is
performed.
Why it exists: NPP evaluates quantifiers under real prevalence variation — how they perform when deployed on random sub-populations drawn from the same underlying distribution as the test set. It is less controlled than APP but more realistic.
Limitation: Because NPP cannot produce extreme prevalences (e.g. 2% positive) without a very large test set, it gives a narrower view of method behaviour than APP.
5.1.3.1. Parameters#
Parameter |
Default |
Explanation |
|---|---|---|
|
required |
Size of each random sub-sample. |
|
|
Number of random sub-samples to draw. |
|
|
Seed for reproducibility. |
from mlquantify.model_selection import NPP
from mlquantify.utils import get_prev_from_labels
protocol = NPP(batch_size=100, n_samples=50, random_state=42)
for idx in protocol.split(X_test, y_test):
X_s, y_s = X_test[idx], y_test[idx]
true_prev = get_prev_from_labels(y_s)
pred_prev = q.predict(X_s)
5.1.4. UPP — Uniform Prevalence Protocol#
UPP samples prevalence vectors uniformly from the probability
simplex. It is exactly APP with the simplex sampling strategy
pinned on ('kraemer' by default). For binary problems it is similar to APP,
but for multiclass problems it avoids the combinatorial explosion of
sweeping all class-prevalence combinations independently.
Why it exists: For \(k\) classes, a grid approach like APP grows as \(O(n^{k-1})\) which quickly becomes intractable. UPP samples \(n\) random vectors from the simplex, covering the multiclass prevalence space efficiently without a rigid grid. Maletzke et al. (2020) recommend UPP for multiclass evaluation.
5.1.4.1. Parameters#
Parameter |
Default |
Explanation |
|---|---|---|
|
required |
Size of each sample. |
|
|
Number of prevalence vectors to sample from the simplex. |
|
|
Simplex sampling strategy, forwarded to
|
|
|
Concentration used when |
|
(deprecated) |
Deprecated alias for |
|
|
Minimum per-class prevalence. Raise (e.g. to |
|
|
Maximum per-class prevalence. |
|
|
Seed. |
from mlquantify.model_selection import UPP
from mlquantify.utils import get_prev_from_labels
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=2000, n_classes=4,
n_informative=6, n_redundant=0,
random_state=42)
X_train, X_test = X[:1500], X[1500:]
y_train, y_test = y[:1500], y[1500:]
protocol = UPP(batch_size=100, n_prevalences=200, strategy='uniform',
random_state=42)
errors = []
for idx in protocol.split(X_test, y_test):
X_s, y_s = X_test[idx], y_test[idx]
true_prev = get_prev_from_labels(y_s)
pred_prev = q.predict(X_s)
errors.append(MAE(true_prev, pred_prev))
5.1.5. PPP — Personalized Prevalence Protocol#
PPP generates samples at class prevalences you specify explicitly,
for targeted evaluation at exact operating points (where APP and UPP sweep the
prevalences for you). Pass a list of prevalence vectors; in the binary case a
single float is read as the positive-class prevalence.
5.1.5.1. Parameters#
Parameter |
Default |
Explanation |
|---|---|---|
|
required |
Size of each sample. |
|
required |
List of target prevalence vectors (or floats for binary problems). |
|
|
Number of samples drawn per target prevalence. |
|
|
Seed for reproducibility. |
from mlquantify.model_selection import PPP
from mlquantify.utils import get_prev_from_labels
protocol = PPP(batch_size=100,
prevalences=[[0.1, 0.9], [0.5, 0.5], [0.9, 0.1]],
random_state=42)
for idx in protocol.split(X_test, y_test):
X_s, y_s = X_test[idx], y_test[idx]
true_prev = get_prev_from_labels(y_s)
pred_prev = q.predict(X_s)
5.1.6. Choosing a Protocol#
Protocol |
Problem type |
Use when |
|---|---|---|
APP |
Binary |
Default for binary problems. Systematic sweep; standard in quantification research. Forman (2005) introduced the concept. |
NPP |
Binary / multiclass |
You want realistic evaluation under natural prevalence variation. |
UPP (uniform) |
Multiclass |
Default for multiclass. Efficient random coverage of the simplex. |
UPP (kraemer) |
Multiclass |
You need a deterministic grid equivalent to APP for multiclass. |
PPP |
Binary / multiclass |
You want to evaluate at specific, hand-picked prevalences. |
Tip
For most workflows, reach for apply_protocol rather than writing the
loop by hand — it accepts the same protocol choice and returns the scores
directly.
Tip
Always fix random_state in protocols when comparing methods so that
all quantifiers are evaluated on exactly the same test samples.
See also
Quantification Foundations for a conceptual overview of why
protocols are necessary. Model Selection and Evaluation for hyperparameter
tuning with GridSearchQ.
5.1.7. References#
References
Forman, G. (2008). Quantifying Counts and Costs via Classification. Data Mining and Knowledge Discovery, 17(2), 164–206.
González, P., Castaño, A., Chawla, N. V., & del Coz, J. J. (2017). A Review on Quantification Learning. ACM Computing Surveys, 50(5), 1–40.
Esuli, A., Fabris, A., Moreo, A., & Sebastiani, F. (2023). Learning to Quantify. The Information Retrieval Series, Springer.