EnsembleQ#
- class mlquantify.meta.EnsembleQ(quantifier, size=50, min_prop=0.1, max_prop=1, selection_metric='all', protocol='uniform', p_metric=0.25, return_type='mean', max_sample_size=None, max_trials=100, n_jobs=1, verbose=False)[source]#
Ensemble Quantifier with prevalence-controlled diversity.
Targets prior probability shift, including shifts whose magnitude is unknown at training time. Trains many copies of a base quantifier on subsamples drawn at deliberately different class prevalences, then aggregates their estimates, optionally keeping only the members whose training distribution resembles the test sample (dynamic selection). The spread of training prevalences brackets the possible test distributions.
- Parameters:
- quantifierBaseQuantifier
The base quantifier replicated across ensemble members.
- sizeint, default=50
Number of ensemble members to train.
- min_propfloat, default=0.1
Minimum class prevalence proportion for sampling batches.
- max_propfloat, default=1.0
Maximum class prevalence proportion for sampling batches; together with
min_propit sets the diversity of training prevalences.- selection_metric{‘all’, ‘ptr’, ‘ds’}, default=’all’
Which members vote at prediction time.
'all': use every member (a plain bagged average).'ptr': keep members whose training prevalence is closest to an initial test estimate.'ds': keep members whose training score distribution is closest to the test distribution.
- p_metricfloat, default=0.25
Fraction of ensemble members to retain when a selection metric is used.
- protocol{‘artificial’, ‘natural’, ‘uniform’, ‘kraemer’}, default=’uniform’
Prevalence-sampling protocol for generating training batches.
- return_type{‘mean’, ‘median’}, default=’mean’
Aggregation function applied to the selected member estimates.
- max_sample_sizeint or None, default=None
Maximum samples per training batch;
Noneuses the full dataset.- max_trialsint, default=100
Maximum sampling attempts per batch.
- n_jobsint, default=1
Number of parallel jobs for training.
- verbosebool, default=False
Print progress messages.
- Attributes:
- modelslist
Fitted ensemble member quantifiers.
- train_prevalenceslist
Training prevalences for each ensemble member.
- classesndarray of shape (n_classes,)
Class labels seen during
fit.
See also
AggregativeBootstrapResampling wrapper for confidence regions.
QuaDaptDrift-resilient adaptation wrapper.
Notes
Members are trained with sampling-with-replacement so that
p(x|y)is preserved while onlyp(y)varies. Dynamic selection ('ptr'/'ds') is what specialises the ensemble to each test bag; with'all'it reduces to a bagged average. Wraps any base quantifier.References
References
[1]Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017). Using Ensembles for Problems with Characterizable Changes in Data Distribution: A Case Study on Quantification. Information Fusion, 34, 87–100.
[2]Pérez-Gállego, P., Castaño, A., Quevedo, J. R., & del Coz, J. J. (2019). Dynamic Ensemble Selection for Quantification Tasks. Information Fusion, 45, 1–15.
Examples
>>> from mlquantify.meta import EnsembleQ >>> from mlquantify.matching import DyS >>> from sklearn.linear_model import LogisticRegression >>> from sklearn.datasets import make_classification >>> X, y = make_classification(n_samples=300, random_state=42) >>> q = EnsembleQ(DyS(estimator=LogisticRegression()), size=10).fit(X, y) >>> q.predict(X) {0: ..., 1: ...}
- ds_get_posteriors(X, y)[source]#
Compute cross-validated posterior probabilities for the DS selection metric.
Fits a logistic regression classifier with hyperparameters tuned by grid-search and returns out-of-fold posterior probabilities for the training data together with a callable for new instances.
- Parameters:
- Xarray-like of shape (n_samples, n_features)
Training feature matrix.
- yarray-like of shape (n_samples,)
Training class labels.
- Returns:
- posteriorsndarray of shape (n_samples, n_classes)
Out-of-fold posterior probabilities for the training data.
- posteriors_generatorcallable
predict_probamethod of the best-fitted estimator, used to generate posteriors for unseen test data duringpredict.
Notes
Cross-validated posteriors ensure that no training sample is scored by a model trained on it, preventing over-optimistic score distributions. A separate logistic regression is used regardless of the base quantifier so that soft scores are always available for the DS metric.
- ds_selection_metric(X, prevalences, train_distributions, posteriors_generator)[source]#
Select members whose training score distribution is closest to the test distribution.
Computes posterior-probability histograms for the test data and retains the top
p_metricfraction of ensemble members ranked by Hellinger distance between their stored training histogram and the test histogram.- Parameters:
- Xarray-like of shape (n_samples, n_features)
Test feature matrix used to compute the test score distribution.
- prevalencesndarray of shape (n_members, n_classes)
Prevalence estimates from each ensemble member.
- train_distributionslist of ndarray
Posterior-probability histograms stored for each member during
fit.- posteriors_generatorcallable
Function that returns posterior probabilities for new data (obtained from
ds_get_posteriorsduringfit).
- Returns:
- selectedlist of ndarray
Prevalence estimates from the selected subset of members.
- fit(X, y)[source]#
Fit the ensemble by training one base quantifier per sampled batch.
Batches are drawn from
(X, y)according to the chosenprotocolso that each member is trained on a subset with a different class prevalence distribution, promoting diversity. Whenselection_metricis'ds', posterior probabilities are precomputed for later use during prediction.- Parameters:
- Xarray-like of shape (n_samples, n_features)
Training feature matrix.
- yarray-like of shape (n_samples,)
Training class labels.
- Returns:
- selfEnsembleQ
The fitted ensemble quantifier.
- Raises:
- ValueError
If
selection_metric='ds'is used on a multiclass dataset.
Examples
>>> from mlquantify.meta import EnsembleQ >>> from mlquantify.matching import DyS >>> from sklearn.linear_model import LogisticRegression >>> from sklearn.datasets import make_classification >>> X, y = make_classification(n_samples=300, random_state=42) >>> q = EnsembleQ(DyS(estimator=LogisticRegression()), size=10).fit(X, y)
- get_metadata_routing()[source]#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequestencapsulating routing information.
- get_params(deep=True)[source]#
Get parameters for this estimator.
- Parameters:
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
- paramsdict
Parameter names mapped to their values.
- predict(X)[source]#
Predict class prevalences by aggregating all ensemble members.
Each fitted member produces a prevalence estimate; if a selection metric (
'ptr'or'ds') was configured, only the most relevant members are retained before computing the finalmeanormedian.- Parameters:
- Xarray-like of shape (n_samples, n_features)
Test feature matrix.
- Returns:
- prevalencesdict or ndarray of shape (n_classes,)
Estimated class prevalences, aggregated across the selected ensemble members.
Examples
>>> from mlquantify.meta import EnsembleQ >>> from mlquantify.matching import DyS >>> from sklearn.linear_model import LogisticRegression >>> from sklearn.datasets import make_classification >>> X, y = make_classification(n_samples=300, random_state=42) >>> q = EnsembleQ(DyS(estimator=LogisticRegression()), size=10).fit(X, y) >>> q.predict(X) {0: ..., 1: ...}
- ptr_selection_metric(prevalences, train_prevalences)[source]#
Select members whose training prevalence is closest to the test estimate.
Computes an initial test-prevalence estimate by averaging all member predictions, then retains the top
p_metricfraction of members ranked by how closely their training prevalence matches that estimate.- Parameters:
- prevalencesndarray of shape (n_members, n_classes)
Prevalence estimates from each ensemble member.
- train_prevalenceslist of dict or ndarray
Training prevalences recorded for each ensemble member during
fit.
- Returns:
- selectedlist of ndarray
Prevalence estimates from the selected subset of members.
- set_params(**params)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline). The latter have parameters of the form<component>__<parameter>so that it’s possible to update each component of a nested object.- Parameters:
- **paramsdict
Estimator parameters.
- Returns:
- selfestimator instance
Estimator instance.