apply_protocol#

mlquantify.model_selection.apply_protocol(quantifier, X, y, protocol='app', *, scoring='mae', batch_size=100, n_prevalences=11, repeats=1, fit=True, test_size=0.4, stratify=True, n_jobs=1, random_state=None, return_predictions=True, return_estimator=False, verbose=0, **protocol_params)[source]#

Evaluate a quantifier across an evaluation protocol.

The protocol analogue of scikit-learn’s cross_validate: it fits the quantifier, generates many test samples with controlled prevalences using a sampling protocol (APP, NPP, UPP or PPP), predicts the prevalence of each sample, and returns the true and predicted prevalences together with one score array per metric. This packages the standard quantification evaluation loop into a single call.

Parameters:
quantifierBaseQuantifier

The quantifier to evaluate. When fit=True a copy is trained, so the passed object is left untouched; when fit=False it must already be fitted.

Xarray-like of shape (n_samples, n_features)

Feature matrix.

yarray-like of shape (n_samples,)

Class labels.

protocol{‘app’, ‘npp’, ‘upp’, ‘ppp’} or BaseProtocol, default=’app’

Sampling protocol used to build the evaluation samples.

  • 'app' : Artificial Prevalence Protocol (grid of prevalences).

  • 'npp' : Natural Prevalence Protocol (random natural samples).

  • 'upp' : Uniform Prevalence Protocol (uniform over the simplex).

  • 'ppp' : Personalized protocol (requires prevalences=...).

A pre-built BaseProtocol instance may also be passed, in which case batch_size, n_prevalences and repeats are ignored.

scoringstr, callable, or list, default=’mae’

Metric(s) used to score each sample. A metric name (e.g. 'mae', 'nmd'), a callable metric(true, pred) -> float, or a list mixing the two. Each becomes a key in the returned dictionary. Relative metrics such as 'rae' / 'nrae' are automatically smoothed with eps = 1 / (2 * sample_size) so classes absent from a sample (zero true prevalence) do not yield inf.

batch_sizeint or list of int, default=100

Size of each evaluation sample.

n_prevalencesint, default=11

Number of prevalence points (APP/UPP) or natural samples (NPP).

repeatsint, default=1

Number of repetitions per prevalence point.

fitbool, default=True

If True, train a copy of quantifier before evaluating; if False, use the already-fitted quantifier directly.

test_sizefloat or int or None, default=0.4

When fit=True, the held-out fraction (or count) the protocol samples from; the quantifier is trained on the complement. None or 0 trains and evaluates on the same data (in-sample).

stratifybool, default=True

Whether the train/evaluation split is stratified by y.

n_jobsint or None, default=1

Number of parallel jobs over the protocol samples.

random_stateint, RandomState instance, or None, default=None

Seed controlling the split and the protocol sampling.

return_predictionsbool, default=True

Whether to include the true and predicted prevalence arrays.

return_estimatorbool, default=False

Whether to include the fitted quantifier under the 'estimator' key.

verboseint, default=0

Verbosity level, following the sklearn convention:

  • 0 (or False) — silent.

  • 1 (or True) — print a one-line summary: protocol name, number of batches, and the mean score for each metric.

  • 2 — additionally print one line per sample showing its index, true prevalence, predicted prevalence, and per-metric scores.

**protocol_paramsdict

Extra protocol arguments forwarded to the constructor (e.g. min_prev, max_prev, strategy and dirichlet_alpha for APP/UPP, prevalences for PPP).

Returns:
resultsdict

Dictionary with the keys:

  • 'true_prevalences' : ndarray of shape (n_samples, n_classes), present when return_predictions=True.

  • 'predicted_prevalences' : ndarray of the same shape.

  • 'n_batches' : int, the number of evaluation samples.

  • one key per metric (e.g. 'MAE') : ndarray of shape (n_samples,) holding the per-sample score.

  • 'estimator' : the fitted quantifier, when return_estimator=True.

See also

APP, NPP, UPP, PPP

The sampling protocols this runs.

GridSearchQ

Hyper-parameter search using the same protocols.

Notes

Aggregate a run with results['MAE'].mean(). The true and predicted prevalence arrays are convenient for diagonal “true vs predicted” diagnostic plots.

Examples

>>> from mlquantify.model_selection import apply_protocol
>>> from mlquantify.counting import CC
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=400, random_state=0)
>>> results = apply_protocol(
...     CC(LogisticRegression()), X, y,
...     protocol="app", n_prevalences=11, batch_size=100,
...     scoring=["mae", "nmd"], random_state=0,
... )
>>> results["true_prevalences"].shape       
(11, 2)
>>> round(float(results["MAE"].mean()), 2)