EnsembleQ#

class mlquantify.meta.EnsembleQ(quantifier, size=50, min_prop=0.1, max_prop=1, selection_metric='all', protocol='uniform', p_metric=0.25, return_type='mean', max_sample_size=None, max_trials=100, n_jobs=1, verbose=False)[source]#

Ensemble-based Quantifier combining multiple models trained on varied data samples with controlled prevalence distributions to improve robustness and accuracy.

This quantifier constructs an ensemble of quantification models using batches of training data sampled according to an evaluation protocol (e.g. ‘artificial’, ‘natural’, ‘uniform’, ‘kraemer’) with specified prevalence constraints. Diverse models are trained on these subsamples, and their prevalence estimates aggregated using various selection metrics and aggregation methods.

Parameters:

quantifierBaseQuantifier: The quantifier model class to be used for ensemble members.
sizeint, default=50: Number of ensemble members (sub-models) to train.
min_prop, max_propfloat, default=(0.1, 1.0): Minimum and maximum class prevalence proportions for generating training batches.
selection_metric{‘all’, ‘ptr’, ‘ds’}, default=’all’: Metric used to select or weight ensemble members during aggregation: - ‘all’: uses all models equally, - ‘ptr’: selects models with prevalences closest to initial test prevalence estimates, - ‘ds’: selects models with score distributions similar to test data.
p_metricfloat, default=0.25: Proportion of ensemble members to select according to the selection metric.
protocol{‘artificial’, ‘natural’, ‘uniform’, ‘kraemer’}, default=’uniform’: Sampling protocol used to generate training data for ensemble models.
return_type{‘mean’, ‘median’}, default=’mean’: Aggregation method for ensemble predictions.
max_sample_sizeint or None, optional: Maximum number of samples per training batch; defaults to dataset size if None.
max_trialsint, default=100: Maximum number of trials for sampling.
n_jobsint, default=1: Number of parallel jobs for training ensemble members.
verbosebool, default=False: Enable verbose output.

Attributes:

modelslist: List of fitted quantifier ensemble members.
train_prevalenceslist: List of training prevalences corresponding to ensemble members.
train_distributionslist: List of historical training posterior histograms (used when selection_metric=’ds’).
posteriors_generatorcallable or None: Function to generate posterior probabilities for new samples.

Notes

Ensemble diversity is controlled by sampling prevalences from the specified protocol.
The ‘ds’ selection metric requires probabilistic quantifiers and computes distribution similarity.
Uses sklearn’s LogisticRegression and GridSearchCV internally for posterior computation within ‘ds’.

References

[1]

Pérez-Gállego, P., Castaño, A., Ramón Quevedo, J., & José del Coz, J. (2019). Dynamic ensemble selection for quantification tasks. Information Fusion, 45, 1-15. https://doi.org/10.1016/j.inffus.2018.01.001

[2]

Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017). Using ensembles for problems with characterizable changes in data distribution: A case study on quantification. Information Fusion, 34, 87-100. https://doi.org/10.1016/j.inffus.2016.07.001

Examples

>>> from mlquantify.ensemble import EnsembleQ
>>> from mlquantify.mixture import DyS
>>> from sklearn.ensemble import RandomForestClassifier
>>>
>>> ensemble = EnsembleQ(
...     quantifier=DyS(RandomForestClassifier()), 
...     size=30, 
...     protocol='artificial', # APP protocol 
...     selection_metric='ptr'
... )
>>> ensemble.fit(X_train, y_train)
>>> prevalence_estimates = ensemble.predict(X_test)

ds_get_posteriors(X, y)[source]#

Generate posterior probabilities using cross-validated logistic regression. This method computes posterior probabilities for the training data via cross-validation, using a logistic regression classifier with hyperparameters optimized through grid search. It also returns a function to generate posterior probabilities for new data.

Parameters:

Xarray-like of shape (n_samples, n_features): The feature matrix representing the training data.
yarray-like of shape (n_samples,): The target vector representing class labels for the training data.

Returns:

posteriorsndarray of shape (n_samples, n_classes): Posterior probabilities for the training data obtained through cross-validation.
posteriors_generatorcallable: A function that computes posterior probabilities for new input data.

Notes

In scenarios where the quantifier is not based on a probabilistic classifier, it’s necessary
to train a separate probabilistic model to obtain posterior probabilities.
Using cross-validation ensures that the posterior probabilities for the training data are unbiased,
as each data point is evaluated by a model not trained on that point.
Hyperparameters for the logistic regression classifier are optimized using a grid search with
cross-validation to improve the model’s performance.

ds_selection_metric(X, prevalences, train_distributions, posteriors_generator)[source]#

Selects the prevalence estimates from models trained on samples whose distribution of posterior probabilities is most similar to the distribution of posterior probabilities for the test data.

Parameters:

prevalencesnumpy.ndarray: An array of prevalence estimates provided by each model in the ensemble.
testarray-like of shape (n_samples, n_features): The feature matrix representing the test data.

Returns:

numpy.ndarray: The selected prevalence estimates after applying the DS selection metric.

fit(X, y)[source]#

Fits the ensemble model to the given training data.

Parameters:

Xarray-like of shape (n_samples, n_features): The input data.
yarray-like of shape (n_samples,): The target values.

Returns:

selfEnsemble: The fitted ensemble model.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:

routingMetadataRequest: A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:

paramsdict: Parameter names mapped to their values.

predict(X)[source]#

Predicts the class prevalences for the given test data.

Parameters:

Xarray-like of shape (n_samples, n_features): The input data.

Returns:

prevalencesarray-like of shape (n_samples, n_classes): The predicted class prevalences.

ptr_selection_metric(prevalences, train_prevalences)[source]#

Selects the prevalence estimates from models trained on samples whose prevalence is most similar to an initial approximation of the test prevalence as estimated by all models in the ensemble.

Parameters:

prevalencesnumpy.ndarray: An array of prevalence estimates provided by each model in the ensemble.

Returns:

numpy.ndarray: The selected prevalence estimates after applying the PTR selection metric.

save_quantifier(path: str | None = None) → None[source]#: Save the quantifier instance to a file.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:

**paramsdict: Estimator parameters.

Returns:

selfestimator instance: Estimator instance.

sout(msg)[source]#: Prints a message if verbose is True.

EnsembleQ#

This Page