EnsembleQ#

class mlquantify.meta.EnsembleQ(quantifier, size=50, min_prop=0.1, max_prop=1, selection_metric='all', protocol='uniform', p_metric=0.25, return_type='mean', max_sample_size=None, max_trials=100, n_jobs=1, verbose=False)[source]#

Ensemble-based Quantifier combining multiple models trained on varied data samples with controlled prevalence distributions to improve robustness and accuracy.

This quantifier constructs an ensemble of quantification models using batches of training data sampled according to an evaluation protocol (e.g. ‘artificial’, ‘natural’, ‘uniform’, ‘kraemer’) with specified prevalence constraints. Diverse models are trained on these subsamples, and their prevalence estimates aggregated using various selection metrics and aggregation methods.

Parameters:
quantifierBaseQuantifier

The quantifier model class to be used for ensemble members.

sizeint, default=50

Number of ensemble members (sub-models) to train.

min_prop, max_propfloat, default=(0.1, 1.0)

Minimum and maximum class prevalence proportions for generating training batches.

selection_metric{‘all’, ‘ptr’, ‘ds’}, default=’all’

Metric used to select or weight ensemble members during aggregation: - ‘all’: uses all models equally, - ‘ptr’: selects models with prevalences closest to initial test prevalence estimates, - ‘ds’: selects models with score distributions similar to test data.

p_metricfloat, default=0.25

Proportion of ensemble members to select according to the selection metric.

protocol{‘artificial’, ‘natural’, ‘uniform’, ‘kraemer’}, default=’uniform’

Sampling protocol used to generate training data for ensemble models.

return_type{‘mean’, ‘median’}, default=’mean’

Aggregation method for ensemble predictions.

max_sample_sizeint or None, optional

Maximum number of samples per training batch; defaults to dataset size if None.

max_trialsint, default=100

Maximum number of trials for sampling.

n_jobsint, default=1

Number of parallel jobs for training ensemble members.

verbosebool, default=False

Enable verbose output.

Attributes:
modelslist

List of fitted quantifier ensemble members.

train_prevalenceslist

List of training prevalences corresponding to ensemble members.

train_distributionslist

List of historical training posterior histograms (used when selection_metric=’ds’).

posteriors_generatorcallable or None

Function to generate posterior probabilities for new samples.

Notes

  • Ensemble diversity is controlled by sampling prevalences from the specified protocol.

  • The ‘ds’ selection metric requires probabilistic quantifiers and computes distribution similarity.

  • Uses sklearn’s LogisticRegression and GridSearchCV internally for posterior computation within ‘ds’.

References

[1]

Pérez-Gállego, P., Castaño, A., Ramón Quevedo, J., & José del Coz, J. (2019). Dynamic ensemble selection for quantification tasks. Information Fusion, 45, 1-15. https://doi.org/10.1016/j.inffus.2018.01.001

[2]

Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017). Using ensembles for problems with characterizable changes in data distribution: A case study on quantification. Information Fusion, 34, 87-100. https://doi.org/10.1016/j.inffus.2016.07.001

Examples

>>> from mlquantify.ensemble import EnsembleQ
>>> from mlquantify.mixture import DyS
>>> from sklearn.ensemble import RandomForestClassifier
>>>
>>> ensemble = EnsembleQ(
...     quantifier=DyS(RandomForestClassifier()), 
...     size=30, 
...     protocol='artificial', # APP protocol 
...     selection_metric='ptr'
... )
>>> ensemble.fit(X_train, y_train)
>>> prevalence_estimates = ensemble.predict(X_test)
ds_get_posteriors(X, y)[source]#

Generate posterior probabilities using cross-validated logistic regression. This method computes posterior probabilities for the training data via cross-validation, using a logistic regression classifier with hyperparameters optimized through grid search. It also returns a function to generate posterior probabilities for new data.

Parameters:
Xarray-like of shape (n_samples, n_features)

The feature matrix representing the training data.

yarray-like of shape (n_samples,)

The target vector representing class labels for the training data.

Returns:
posteriorsndarray of shape (n_samples, n_classes)

Posterior probabilities for the training data obtained through cross-validation.

posteriors_generatorcallable

A function that computes posterior probabilities for new input data.

Notes

  • In scenarios where the quantifier is not based on a probabilistic classifier, it’s necessary

    to train a separate probabilistic model to obtain posterior probabilities.

  • Using cross-validation ensures that the posterior probabilities for the training data are unbiased,

    as each data point is evaluated by a model not trained on that point.

  • Hyperparameters for the logistic regression classifier are optimized using a grid search with

    cross-validation to improve the model’s performance.

ds_selection_metric(X, prevalences, train_distributions, posteriors_generator)[source]#

Selects the prevalence estimates from models trained on samples whose distribution of posterior probabilities is most similar to the distribution of posterior probabilities for the test data.

Parameters:
prevalencesnumpy.ndarray

An array of prevalence estimates provided by each model in the ensemble.

testarray-like of shape (n_samples, n_features)

The feature matrix representing the test data.

Returns:
numpy.ndarray

The selected prevalence estimates after applying the DS selection metric.

fit(X, y)[source]#

Fits the ensemble model to the given training data.

Parameters:
Xarray-like of shape (n_samples, n_features)

The input data.

yarray-like of shape (n_samples,)

The target values.

Returns:
selfEnsemble

The fitted ensemble model.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

predict(X)[source]#

Predicts the class prevalences for the given test data.

Parameters:
Xarray-like of shape (n_samples, n_features)

The input data.

Returns:
prevalencesarray-like of shape (n_samples, n_classes)

The predicted class prevalences.

ptr_selection_metric(prevalences, train_prevalences)[source]#

Selects the prevalence estimates from models trained on samples whose prevalence is most similar to an initial approximation of the test prevalence as estimated by all models in the ensemble.

Parameters:
prevalencesnumpy.ndarray

An array of prevalence estimates provided by each model in the ensemble.

Returns:
numpy.ndarray

The selected prevalence estimates after applying the PTR selection metric.

save_quantifier(path: str | None = None) None[source]#

Save the quantifier instance to a file.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

sout(msg)[source]#

Prints a message if verbose is True.