APP#
- class mlquantify.model_selection.APP(batch_size, n_prevalences, repeats=1, random_state=None, min_prev=0.0, max_prev=1.0, strategy='grid', dirichlet_alpha=1.0)[source]#
Artificial Prevalence Protocol (APP).
Generates evaluation batches with artificially imposed prevalences drawn from the probability simplex within
[min_prev, max_prev], covering a range of prevalence levels for comprehensive evaluation. The way the prevalence points are produced is controlled by thestrategyparameter.- Parameters:
- batch_sizeint or list of int
Size(s) of the evaluation batches.
- n_prevalencesint
Number of prevalence points. For
strategy='grid'this is the number of grid points per class dimension; for the sampling strategies it is the number of prevalence vectors drawn from the simplex.- repeatsint, default=1
Number of repetitions for each prevalence combination.
- random_stateint or None, default=None
Random seed for reproducibility.
- min_prevfloat, default=0.0
Minimum class prevalence.
- max_prevfloat, default=1.0
Maximum class prevalence.
- strategy{‘grid’, ‘kraemer’, ‘uniform’, ‘dirichlet’}, default=’grid’
How prevalence vectors are generated over the simplex:
'grid'— a regular lattice of evenly-spaced prevalences frommin_prevtomax_prev(the classic APP). Deterministic and gives systematic coverage, but the number of points grows combinatorially (\(O(n^{k-1})\) forkclasses), so it is best for the binary / low-class-count case.'kraemer'— the Kraemer method for uniform sampling over the simplex. Every prevalence combination is equally likely and the cost is independent of the number of classes, making it well suited to multiclass problems where a grid would explode.'uniform'— uniform sampling over the simplex via the flat Dirichlet distribution \(\mathrm{Dir}(\mathbf{1})\). Statistically equivalent to'kraemer'(uniform coverage) but produced through the Dirichlet route; it is exactly'dirichlet'withdirichlet_alpha=1.'dirichlet'— sampling from a Dirichlet distribution whose concentration is set bydirichlet_alpha. Use this to bias the sampling:alpha > 1favours balanced prevalences near the centre of the simplex, whilealpha < 1favours extreme, one-class-dominant prevalences near the corners.
- dirichlet_alphafloat or array-like, default=1.0
Concentration parameter used when
strategy='dirichlet'. A scalar is broadcast to a symmetric Dirichlet over all classes; an array of lengthn_classessets a per-class concentration. Ignored by the other strategies ('uniform'always uses1.0).
- Attributes:
- n_combinationsint
Total number of batches generated.
Notes
For multiclass problems the
'grid'strategy grows combinatorially; prefer a sampling strategy (orUPP) for large class counts.References
References
[1]Forman, G. (2008). Quantifying Counts and Costs via Classification. Data Mining and Knowledge Discovery, 17(2), 164–206.
[2]Sebastiani, F., et al. (2020). A Critical Reassessment of the Evaluation of Machine Learning Approaches for Quantification. ArXiv preprint.
Examples
>>> from mlquantify.model_selection import APP >>> import numpy as np >>> X, y = np.random.randn(200, 5), np.random.randint(0, 2, 200) >>> proto = APP(batch_size=50, n_prevalences=5, random_state=0) >>> batches = list(proto.split(X, y)) >>> len(batches) 5
- get_metadata_routing()[source]#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequestencapsulating routing information.
- get_params(deep=True)[source]#
Get parameters for this estimator.
- Parameters:
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
- paramsdict
Parameter names mapped to their values.
- set_params(**params)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline). The latter have parameters of the form<component>__<parameter>so that it’s possible to update each component of a nested object.- Parameters:
- **paramsdict
Estimator parameters.
- Returns:
- selfestimator instance
Estimator instance.