EnsembleQ#

class mlquantify.meta.EnsembleQ(quantifier, size=50, min_prop=0.1, max_prop=1, selection_metric='all', protocol='uniform', p_metric=0.25, return_type='mean', max_sample_size=None, max_trials=100, n_jobs=1, verbose=False)[source]#

Ensemble Quantifier with prevalence-controlled diversity.

Targets prior probability shift, including shifts whose magnitude is unknown at training time. Trains many copies of a base quantifier on subsamples drawn at deliberately different class prevalences, then aggregates their estimates, optionally keeping only the members whose training distribution resembles the test sample (dynamic selection). The spread of training prevalences brackets the possible test distributions.

Parameters:
quantifierBaseQuantifier

The base quantifier replicated across ensemble members.

sizeint, default=50

Number of ensemble members to train.

min_propfloat, default=0.1

Minimum class prevalence proportion for sampling batches.

max_propfloat, default=1.0

Maximum class prevalence proportion for sampling batches; together with min_prop it sets the diversity of training prevalences.

selection_metric{‘all’, ‘ptr’, ‘ds’}, default=’all’

Which members vote at prediction time.

  • 'all' : use every member (a plain bagged average).

  • 'ptr' : keep members whose training prevalence is closest to an initial test estimate.

  • 'ds' : keep members whose training score distribution is closest to the test distribution.

p_metricfloat, default=0.25

Fraction of ensemble members to retain when a selection metric is used.

protocol{‘artificial’, ‘natural’, ‘uniform’, ‘kraemer’}, default=’uniform’

Prevalence-sampling protocol for generating training batches.

return_type{‘mean’, ‘median’}, default=’mean’

Aggregation function applied to the selected member estimates.

max_sample_sizeint or None, default=None

Maximum samples per training batch; None uses the full dataset.

max_trialsint, default=100

Maximum sampling attempts per batch.

n_jobsint, default=1

Number of parallel jobs for training.

verbosebool, default=False

Print progress messages.

Attributes:
modelslist

Fitted ensemble member quantifiers.

train_prevalenceslist

Training prevalences for each ensemble member.

classesndarray of shape (n_classes,)

Class labels seen during fit.

See also

AggregativeBootstrap

Resampling wrapper for confidence regions.

QuaDapt

Drift-resilient adaptation wrapper.

Notes

Members are trained with sampling-with-replacement so that p(x|y) is preserved while only p(y) varies. Dynamic selection ('ptr'/'ds') is what specialises the ensemble to each test bag; with 'all' it reduces to a bagged average. Wraps any base quantifier.

References

References
[1]

Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017). Using Ensembles for Problems with Characterizable Changes in Data Distribution: A Case Study on Quantification. Information Fusion, 34, 87–100.

[2]

Pérez-Gállego, P., Castaño, A., Quevedo, J. R., & del Coz, J. J. (2019). Dynamic Ensemble Selection for Quantification Tasks. Information Fusion, 45, 1–15.

Examples

>>> from mlquantify.meta import EnsembleQ
>>> from mlquantify.matching import DyS
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=300, random_state=42)
>>> q = EnsembleQ(DyS(estimator=LogisticRegression()), size=10).fit(X, y)
>>> q.predict(X)
{0: ..., 1: ...}
ds_get_posteriors(X, y)[source]#

Compute cross-validated posterior probabilities for the DS selection metric.

Fits a logistic regression classifier with hyperparameters tuned by grid-search and returns out-of-fold posterior probabilities for the training data together with a callable for new instances.

Parameters:
Xarray-like of shape (n_samples, n_features)

Training feature matrix.

yarray-like of shape (n_samples,)

Training class labels.

Returns:
posteriorsndarray of shape (n_samples, n_classes)

Out-of-fold posterior probabilities for the training data.

posteriors_generatorcallable

predict_proba method of the best-fitted estimator, used to generate posteriors for unseen test data during predict.

Notes

Cross-validated posteriors ensure that no training sample is scored by a model trained on it, preventing over-optimistic score distributions. A separate logistic regression is used regardless of the base quantifier so that soft scores are always available for the DS metric.

ds_selection_metric(X, prevalences, train_distributions, posteriors_generator)[source]#

Select members whose training score distribution is closest to the test distribution.

Computes posterior-probability histograms for the test data and retains the top p_metric fraction of ensemble members ranked by Hellinger distance between their stored training histogram and the test histogram.

Parameters:
Xarray-like of shape (n_samples, n_features)

Test feature matrix used to compute the test score distribution.

prevalencesndarray of shape (n_members, n_classes)

Prevalence estimates from each ensemble member.

train_distributionslist of ndarray

Posterior-probability histograms stored for each member during fit.

posteriors_generatorcallable

Function that returns posterior probabilities for new data (obtained from ds_get_posteriors during fit).

Returns:
selectedlist of ndarray

Prevalence estimates from the selected subset of members.

fit(X, y)[source]#

Fit the ensemble by training one base quantifier per sampled batch.

Batches are drawn from (X, y) according to the chosen protocol so that each member is trained on a subset with a different class prevalence distribution, promoting diversity. When selection_metric is 'ds', posterior probabilities are precomputed for later use during prediction.

Parameters:
Xarray-like of shape (n_samples, n_features)

Training feature matrix.

yarray-like of shape (n_samples,)

Training class labels.

Returns:
selfEnsembleQ

The fitted ensemble quantifier.

Raises:
ValueError

If selection_metric='ds' is used on a multiclass dataset.

Examples

>>> from mlquantify.meta import EnsembleQ
>>> from mlquantify.matching import DyS
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=300, random_state=42)
>>> q = EnsembleQ(DyS(estimator=LogisticRegression()), size=10).fit(X, y)
get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

predict(X)[source]#

Predict class prevalences by aggregating all ensemble members.

Each fitted member produces a prevalence estimate; if a selection metric ('ptr' or 'ds') was configured, only the most relevant members are retained before computing the final mean or median.

Parameters:
Xarray-like of shape (n_samples, n_features)

Test feature matrix.

Returns:
prevalencesdict or ndarray of shape (n_classes,)

Estimated class prevalences, aggregated across the selected ensemble members.

Examples

>>> from mlquantify.meta import EnsembleQ
>>> from mlquantify.matching import DyS
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=300, random_state=42)
>>> q = EnsembleQ(DyS(estimator=LogisticRegression()), size=10).fit(X, y)
>>> q.predict(X)
{0: ..., 1: ...}
ptr_selection_metric(prevalences, train_prevalences)[source]#

Select members whose training prevalence is closest to the test estimate.

Computes an initial test-prevalence estimate by averaging all member predictions, then retains the top p_metric fraction of members ranked by how closely their training prevalence matches that estimate.

Parameters:
prevalencesndarray of shape (n_members, n_classes)

Prevalence estimates from each ensemble member.

train_prevalenceslist of dict or ndarray

Training prevalences recorded for each ensemble member during fit.

Returns:
selectedlist of ndarray

Prevalence estimates from the selected subset of members.

save_quantifier(path: str | None = None) None[source]#

Save the quantifier instance to a file.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

sout(msg)[source]#

Prints a message if verbose is True.