EnsembleQ#

class mlquantify.meta.EnsembleQ(quantifier, size=50, min_prop=0.1, max_prop=1, selection_metric='all', protocol='uniform', p_metric=0.25, return_type='mean', max_sample_size=None, max_trials=100, n_jobs=1, verbose=False)[source]#

Ensemble-based quantifier with prevalence-controlled diversity.

Constructs an ensemble of base quantifiers, each trained on a subsample of the training data drawn according to a prevalence-sampling protocol. At prediction time the ensemble members’ estimates are aggregated, with optional selection of the most relevant subset.

Parameters:
quantifierBaseQuantifier

The base quantifier used for each ensemble member.

sizeint, default=50

Number of ensemble members to train.

min_propfloat, default=0.1

Minimum class prevalence proportion for sampling batches.

max_propfloat, default=1.0

Maximum class prevalence proportion for sampling batches.

selection_metric{‘all’, ‘ptr’, ‘ds’}, default=’all’

Member selection strategy. 'all' uses every member; 'ptr' selects members whose training prevalence is closest to the initial test estimate; 'ds' selects members whose training score distribution is closest to the test distribution.

p_metricfloat, default=0.25

Fraction of ensemble members to retain when applying a selection metric.

protocol{‘artificial’, ‘natural’, ‘uniform’, ‘kraemer’}, default=’uniform’

Prevalence-sampling protocol for generating training batches.

return_type{‘mean’, ‘median’}, default=’mean’

Aggregation function applied to the selected member estimates.

max_sample_sizeint or None, default=None

Maximum samples per training batch; None uses the full dataset.

max_trialsint, default=100

Maximum sampling attempts per batch.

n_jobsint, default=1

Number of parallel jobs for training.

verbosebool, default=False

Print progress messages.

Attributes:
modelslist

Fitted ensemble member quantifiers.

train_prevalenceslist

Training prevalences for each ensemble member.

classesndarray of shape (n_classes,)

Class labels seen during fit.

References

References
[1]

Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017). Using Ensembles for Problems with Characterizable Changes in Data Distribution: A Case Study on Quantification. Information Fusion, 34, 87–100.

[2]

Pérez-Gállego, P., Castaño, A., Quevedo, J. R., & del Coz, J. J. (2019). Dynamic Ensemble Selection for Quantification Tasks. Information Fusion, 45, 1–15.

Examples

>>> from mlquantify.meta import EnsembleQ
>>> from mlquantify.matching import DyS
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=300, random_state=42)
>>> q = EnsembleQ(DyS(estimator=LogisticRegression()), size=10).fit(X, y)
>>> q.predict(X)
{0: 0.49, 1: 0.51}
ds_get_posteriors(X, y)[source]#

Compute cross-validated posterior probabilities for the DS selection metric.

Fits a logistic regression classifier with hyperparameters tuned by grid-search and returns out-of-fold posterior probabilities for the training data together with a callable for new instances.

Parameters:
Xarray-like of shape (n_samples, n_features)

Training feature matrix.

yarray-like of shape (n_samples,)

Training class labels.

Returns:
posteriorsndarray of shape (n_samples, n_classes)

Out-of-fold posterior probabilities for the training data.

posteriors_generatorcallable

predict_proba method of the best-fitted estimator, used to generate posteriors for unseen test data during predict.

Notes

Cross-validated posteriors ensure that no training sample is scored by a model trained on it, preventing over-optimistic score distributions. A separate logistic regression is used regardless of the base quantifier so that soft scores are always available for the DS metric.

ds_selection_metric(X, prevalences, train_distributions, posteriors_generator)[source]#

Select members whose training score distribution is closest to the test distribution.

Computes posterior-probability histograms for the test data and retains the top p_metric fraction of ensemble members ranked by Hellinger distance between their stored training histogram and the test histogram.

Parameters:
Xarray-like of shape (n_samples, n_features)

Test feature matrix used to compute the test score distribution.

prevalencesndarray of shape (n_members, n_classes)

Prevalence estimates from each ensemble member.

train_distributionslist of ndarray

Posterior-probability histograms stored for each member during fit.

posteriors_generatorcallable

Function that returns posterior probabilities for new data (obtained from ds_get_posteriors during fit).

Returns:
selectedlist of ndarray

Prevalence estimates from the selected subset of members.

fit(X, y)[source]#

Fit the ensemble by training one base quantifier per sampled batch.

Batches are drawn from (X, y) according to the chosen protocol so that each member is trained on a subset with a different class prevalence distribution, promoting diversity. When selection_metric is 'ds', posterior probabilities are precomputed for later use during prediction.

Parameters:
Xarray-like of shape (n_samples, n_features)

Training feature matrix.

yarray-like of shape (n_samples,)

Training class labels.

Returns:
selfEnsembleQ

The fitted ensemble quantifier.

Raises:
ValueError

If selection_metric='ds' is used on a multiclass dataset.

Examples

>>> from mlquantify.meta import EnsembleQ
>>> from mlquantify.matching import DyS
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=300, random_state=42)
>>> q = EnsembleQ(DyS(estimator=LogisticRegression()), size=10).fit(X, y)
get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

predict(X)[source]#

Predict class prevalences by aggregating all ensemble members.

Each fitted member produces a prevalence estimate; if a selection metric ('ptr' or 'ds') was configured, only the most relevant members are retained before computing the final mean or median.

Parameters:
Xarray-like of shape (n_samples, n_features)

Test feature matrix.

Returns:
prevalencesdict or ndarray of shape (n_classes,)

Estimated class prevalences, aggregated across the selected ensemble members.

Examples

>>> from mlquantify.meta import EnsembleQ
>>> from mlquantify.matching import DyS
>>> from sklearn.linear_model import LogisticRegression
>>> from sklearn.datasets import make_classification
>>> X, y = make_classification(n_samples=300, random_state=42)
>>> q = EnsembleQ(DyS(estimator=LogisticRegression()), size=10).fit(X, y)
>>> q.predict(X)
{0: 0.49, 1: 0.51}
ptr_selection_metric(prevalences, train_prevalences)[source]#

Select members whose training prevalence is closest to the test estimate.

Computes an initial test-prevalence estimate by averaging all member predictions, then retains the top p_metric fraction of members ranked by how closely their training prevalence matches that estimate.

Parameters:
prevalencesndarray of shape (n_members, n_classes)

Prevalence estimates from each ensemble member.

train_prevalenceslist of dict or ndarray

Training prevalences recorded for each ensemble member during fit.

Returns:
selectedlist of ndarray

Prevalence estimates from the selected subset of members.

save_quantifier(path: str | None = None) None[source]#

Save the quantifier instance to a file.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

sout(msg)[source]#

Prints a message if verbose is True.