APP#

class mlquantify.model_selection.APP(batch_size, n_prevalences, repeats=1, random_state=None, min_prev=0.0, max_prev=1.0, strategy='grid', dirichlet_alpha=1.0)[source]#

Artificial Prevalence Protocol (APP).

Generates evaluation batches with artificially imposed prevalences drawn from the probability simplex within [min_prev, max_prev], covering a range of prevalence levels for comprehensive evaluation. The way the prevalence points are produced is controlled by the strategy parameter.

Parameters:
batch_sizeint or list of int

Size(s) of the evaluation batches.

n_prevalencesint

Number of prevalence points. For strategy='grid' this is the number of grid points per class dimension; for the sampling strategies it is the number of prevalence vectors drawn from the simplex.

repeatsint, default=1

Number of repetitions for each prevalence combination.

random_stateint or None, default=None

Random seed for reproducibility.

min_prevfloat, default=0.0

Minimum class prevalence.

max_prevfloat, default=1.0

Maximum class prevalence.

strategy{‘grid’, ‘kraemer’, ‘uniform’, ‘dirichlet’}, default=’grid’

How prevalence vectors are generated over the simplex:

  • 'grid' — a regular lattice of evenly-spaced prevalences from min_prev to max_prev (the classic APP). Deterministic and gives systematic coverage, but the number of points grows combinatorially (\(O(n^{k-1})\) for k classes), so it is best for the binary / low-class-count case.

  • 'kraemer' — the Kraemer method for uniform sampling over the simplex. Every prevalence combination is equally likely and the cost is independent of the number of classes, making it well suited to multiclass problems where a grid would explode.

  • 'uniform' — uniform sampling over the simplex via the flat Dirichlet distribution \(\mathrm{Dir}(\mathbf{1})\). Statistically equivalent to 'kraemer' (uniform coverage) but produced through the Dirichlet route; it is exactly 'dirichlet' with dirichlet_alpha=1.

  • 'dirichlet' — sampling from a Dirichlet distribution whose concentration is set by dirichlet_alpha. Use this to bias the sampling: alpha > 1 favours balanced prevalences near the centre of the simplex, while alpha < 1 favours extreme, one-class-dominant prevalences near the corners.

dirichlet_alphafloat or array-like, default=1.0

Concentration parameter used when strategy='dirichlet'. A scalar is broadcast to a symmetric Dirichlet over all classes; an array of length n_classes sets a per-class concentration. Ignored by the other strategies ('uniform' always uses 1.0).

Attributes:
n_combinationsint

Total number of batches generated.

Notes

For multiclass problems the 'grid' strategy grows combinatorially; prefer a sampling strategy (or UPP) for large class counts.

References

References
[1]

Forman, G. (2008). Quantifying Counts and Costs via Classification. Data Mining and Knowledge Discovery, 17(2), 164–206.

[2]

Sebastiani, F., et al. (2020). A Critical Reassessment of the Evaluation of Machine Learning Approaches for Quantification. ArXiv preprint.

Examples

>>> from mlquantify.model_selection import APP
>>> import numpy as np
>>> X, y = np.random.randn(200, 5), np.random.randint(0, 2, 200)
>>> proto = APP(batch_size=50, n_prevalences=5, random_state=0)
>>> batches = list(proto.split(X, y))
>>> len(batches)
5
get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_n_combinations()[source]#

Get the number of combinations for the current protocol.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

save_quantifier(path: str | None = None) None[source]#

Save the quantifier instance to a file.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

split(X: ndarray, y: ndarray)[source]#

Split the data into samples for evaluation.

Parameters:
Xnp.ndarray

The input features.

ynp.ndarray

The target labels.

Yields:
Generator[np.ndarray, np.ndarray]

A generator that yields the indices for each split.