EnsembleQ#
- class mlquantify.meta.EnsembleQ(quantifier, size=50, min_prop=0.1, max_prop=1, selection_metric='all', protocol='uniform', p_metric=0.25, return_type='mean', max_sample_size=None, max_trials=100, n_jobs=1, verbose=False)[source]#
Ensemble-based quantifier with prevalence-controlled diversity.
Constructs an ensemble of base quantifiers, each trained on a subsample of the training data drawn according to a prevalence-sampling protocol. At prediction time the ensemble members’ estimates are aggregated, with optional selection of the most relevant subset.
- Parameters:
- quantifierBaseQuantifier
The base quantifier used for each ensemble member.
- sizeint, default=50
Number of ensemble members to train.
- min_propfloat, default=0.1
Minimum class prevalence proportion for sampling batches.
- max_propfloat, default=1.0
Maximum class prevalence proportion for sampling batches.
- selection_metric{‘all’, ‘ptr’, ‘ds’}, default=’all’
Member selection strategy.
'all'uses every member;'ptr'selects members whose training prevalence is closest to the initial test estimate;'ds'selects members whose training score distribution is closest to the test distribution.- p_metricfloat, default=0.25
Fraction of ensemble members to retain when applying a selection metric.
- protocol{‘artificial’, ‘natural’, ‘uniform’, ‘kraemer’}, default=’uniform’
Prevalence-sampling protocol for generating training batches.
- return_type{‘mean’, ‘median’}, default=’mean’
Aggregation function applied to the selected member estimates.
- max_sample_sizeint or None, default=None
Maximum samples per training batch;
Noneuses the full dataset.- max_trialsint, default=100
Maximum sampling attempts per batch.
- n_jobsint, default=1
Number of parallel jobs for training.
- verbosebool, default=False
Print progress messages.
- Attributes:
- modelslist
Fitted ensemble member quantifiers.
- train_prevalenceslist
Training prevalences for each ensemble member.
- classesndarray of shape (n_classes,)
Class labels seen during
fit.
References
References
[1]Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017). Using Ensembles for Problems with Characterizable Changes in Data Distribution: A Case Study on Quantification. Information Fusion, 34, 87–100.
[2]Pérez-Gállego, P., Castaño, A., Quevedo, J. R., & del Coz, J. J. (2019). Dynamic Ensemble Selection for Quantification Tasks. Information Fusion, 45, 1–15.
Examples
>>> from mlquantify.meta import EnsembleQ >>> from mlquantify.matching import DyS >>> from sklearn.linear_model import LogisticRegression >>> from sklearn.datasets import make_classification >>> X, y = make_classification(n_samples=300, random_state=42) >>> q = EnsembleQ(DyS(estimator=LogisticRegression()), size=10).fit(X, y) >>> q.predict(X) {0: 0.49, 1: 0.51}
- ds_get_posteriors(X, y)[source]#
Compute cross-validated posterior probabilities for the DS selection metric.
Fits a logistic regression classifier with hyperparameters tuned by grid-search and returns out-of-fold posterior probabilities for the training data together with a callable for new instances.
- Parameters:
- Xarray-like of shape (n_samples, n_features)
Training feature matrix.
- yarray-like of shape (n_samples,)
Training class labels.
- Returns:
- posteriorsndarray of shape (n_samples, n_classes)
Out-of-fold posterior probabilities for the training data.
- posteriors_generatorcallable
predict_probamethod of the best-fitted estimator, used to generate posteriors for unseen test data duringpredict.
Notes
Cross-validated posteriors ensure that no training sample is scored by a model trained on it, preventing over-optimistic score distributions. A separate logistic regression is used regardless of the base quantifier so that soft scores are always available for the DS metric.
- ds_selection_metric(X, prevalences, train_distributions, posteriors_generator)[source]#
Select members whose training score distribution is closest to the test distribution.
Computes posterior-probability histograms for the test data and retains the top
p_metricfraction of ensemble members ranked by Hellinger distance between their stored training histogram and the test histogram.- Parameters:
- Xarray-like of shape (n_samples, n_features)
Test feature matrix used to compute the test score distribution.
- prevalencesndarray of shape (n_members, n_classes)
Prevalence estimates from each ensemble member.
- train_distributionslist of ndarray
Posterior-probability histograms stored for each member during
fit.- posteriors_generatorcallable
Function that returns posterior probabilities for new data (obtained from
ds_get_posteriorsduringfit).
- Returns:
- selectedlist of ndarray
Prevalence estimates from the selected subset of members.
- fit(X, y)[source]#
Fit the ensemble by training one base quantifier per sampled batch.
Batches are drawn from
(X, y)according to the chosenprotocolso that each member is trained on a subset with a different class prevalence distribution, promoting diversity. Whenselection_metricis'ds', posterior probabilities are precomputed for later use during prediction.- Parameters:
- Xarray-like of shape (n_samples, n_features)
Training feature matrix.
- yarray-like of shape (n_samples,)
Training class labels.
- Returns:
- selfEnsembleQ
The fitted ensemble quantifier.
- Raises:
- ValueError
If
selection_metric='ds'is used on a multiclass dataset.
Examples
>>> from mlquantify.meta import EnsembleQ >>> from mlquantify.matching import DyS >>> from sklearn.linear_model import LogisticRegression >>> from sklearn.datasets import make_classification >>> X, y = make_classification(n_samples=300, random_state=42) >>> q = EnsembleQ(DyS(estimator=LogisticRegression()), size=10).fit(X, y)
- get_metadata_routing()[source]#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequestencapsulating routing information.
- get_params(deep=True)[source]#
Get parameters for this estimator.
- Parameters:
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
- paramsdict
Parameter names mapped to their values.
- predict(X)[source]#
Predict class prevalences by aggregating all ensemble members.
Each fitted member produces a prevalence estimate; if a selection metric (
'ptr'or'ds') was configured, only the most relevant members are retained before computing the finalmeanormedian.- Parameters:
- Xarray-like of shape (n_samples, n_features)
Test feature matrix.
- Returns:
- prevalencesdict or ndarray of shape (n_classes,)
Estimated class prevalences, aggregated across the selected ensemble members.
Examples
>>> from mlquantify.meta import EnsembleQ >>> from mlquantify.matching import DyS >>> from sklearn.linear_model import LogisticRegression >>> from sklearn.datasets import make_classification >>> X, y = make_classification(n_samples=300, random_state=42) >>> q = EnsembleQ(DyS(estimator=LogisticRegression()), size=10).fit(X, y) >>> q.predict(X) {0: 0.49, 1: 0.51}
- ptr_selection_metric(prevalences, train_prevalences)[source]#
Select members whose training prevalence is closest to the test estimate.
Computes an initial test-prevalence estimate by averaging all member predictions, then retains the top
p_metricfraction of members ranked by how closely their training prevalence matches that estimate.- Parameters:
- prevalencesndarray of shape (n_members, n_classes)
Prevalence estimates from each ensemble member.
- train_prevalenceslist of dict or ndarray
Training prevalences recorded for each ensemble member during
fit.
- Returns:
- selectedlist of ndarray
Prevalence estimates from the selected subset of members.
- set_params(**params)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline). The latter have parameters of the form<component>__<parameter>so that it’s possible to update each component of a nested object.- Parameters:
- **paramsdict
Estimator parameters.
- Returns:
- selfestimator instance
Estimator instance.