Ensemble#

class mlquantify.methods.meta.Ensemble(quantifier: Quantifier, size: int = 50, min_prop: float = 0.1, selection_metric: str = 'all', p_metric: float = 0.25, return_type: str = 'mean', max_sample_size: int | None = None, max_trials: int = 100, n_jobs: int = 1, verbose: bool = False)[source]#

Ensemble of Quantification Models.

This class implements an ensemble of quantification methods, allowing parallel processing for evaluation. The ensemble method is based on the articles by Pérez-Gállego et al. (2017, 2019).

This approach of Ensemble is made of taking multiple samples varying class proportions on each, and for the predictions, it takes the k models which as the minimum seletion metric

Parameters:
quantifierQuantifier

The base quantifier model to be used in the ensemble.

sizeint, optional (default=50)

The number of samples to be generated for the ensemble.

min_propfloat, optional (default=0.1)

The minimum proportion of each class in the generated samples.

selection_metricstr, optional (default=’all’)

The metric used for selecting the best models in the ensemble. Valid options are ‘all’, ‘ptr’, and ‘ds’.

p_metricfloat, optional (default=0.25)

The proportion of models to be selected based on the selection metric.

return_typestr, optional (default=’mean’)

The type of aggregation to be used for the final prediction. Valid options are ‘mean’ and ‘median’.

max_sample_sizeint or None, optional (default=None)

The maximum size of the samples to be generated. If None, the entire dataset is used.

max_trialsint, optional (default=100)

The maximum number of trials to generate valid samples.

n_jobsint, optional (default=1)

The number of parallel jobs to run.

verbosebool, optional (default=False)

If True, prints progress messages during fitting and prediction.

Attributes:
base_quantifierQuantifier

The base quantifier model to be used in the ensemble.

sizeint

The number of samples to be generated for the ensemble.

min_propfloat

The minimum proportion of each class in the generated samples.

selection_metricstr

The metric used for selecting the best models in the ensemble. Valid options are ‘all’, ‘ptr’, and ‘ds’. - all -> return all the predictions - ptr -> computes the selected error measure - ds -> computes the hellinger distance of the train and test

distributions for each model

p_metricfloat

The proportion of models to be selected based on the selection metric.

return_typestr

The type of aggregation to be used for the final prediction. Valid options are ‘mean’ and ‘median’.

max_sample_sizeint or None

The maximum size of the samples to be generated. If None, the entire dataset is used.

max_trialsint

The maximum number of trials to generate valid samples.

n_jobsint

The number of parallel jobs to run.

verbosebool

If True, prints progress messages during fitting and prediction.

See also

joblib.Parallel

Parallel processing utility for Python.

References

[1]

PÉREZ-GÁLLEGO, Pablo; QUEVEDO, José Ramón; DEL COZ, Juan José. Using ensembles for problems with characterizable changes in data distribution: A case study on quantification. Information Fusion, v. 34, p. 87-100, 2017. Avaliable at https://www.sciencedirect.com/science/article/abs/pii/S1566253516300628?casa_token=XblH-3kwhf4AAAAA:oxNRiCdHZQQa1C8BCJM5PBnFrd26p8-9SSBdm8Luf1Dm35w88w0NdpvoCf1RxBBqtshjyAhNpsDd

[2]

PÉREZ-GÁLLEGO, Pablo et al. Dynamic ensemble selection for quantification tasks. Information Fusion, v. 45, p. 1-15, 2019. Avaliable at https://www.sciencedirect.com/science/article/abs/pii/S1566253517303652?casa_token=jWmc592j5uMAAAAA:2YNeZGAGD0NJEMkcO-YBr7Ak-Ik7njLEcG8SKdowLdpbJ0mwPjYKKiqvQ-C3qICG8yU0m4xUZ3Yv

Examples

>>> from mlquantify.methods import FM, Ensemble
>>> from mlquantify.utils.general import get_real_prev
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.datasets import load_breast_cancer
>>> from sklearn.model_selection import train_test_split
>>> 
>>> features, target = load_breast_cancer(return_X_y=True)
>>> 
>>> X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3)
>>> 
>>> model = FM(RandomForestClassifier())
>>> ensemble = Ensemble(quantifier=model,
...                     size=50,
...                     selection_metric='ptr',
...                     return_type='median',
...                     n_jobs=-1,
...                     verbose=False)
>>> 
>>> ensemble.fit(X_train, y_train)
>>> 
>>> predictions = ensemble.predict(X_test)
>>> predictions
{0: 0.4589857954621449, 1: 0.5410142045378551}
>>> get_real_prev(y_test)
{0: 0.45614035087719296, 1: 0.543859649122807}
ds_get_posteriors(X, y)[source]#

Generate posterior probabilities using cross-validated logistic regression. This method computes posterior probabilities for the training data via cross-validation, using a logistic regression classifier with hyperparameters optimized through grid search. It also returns a function to generate posterior probabilities for new data.

Parameters:
Xarray-like of shape (n_samples, n_features)

The feature matrix representing the training data.

yarray-like of shape (n_samples,)

The target vector representing class labels for the training data.

Returns:
posteriorsndarray of shape (n_samples, n_classes)

Posterior probabilities for the training data obtained through cross-validation.

posteriors_generatorcallable

A function that computes posterior probabilities for new input data.

Notes

  • In scenarios where the quantifier is not based on a probabilistic classifier, it’s necessary

    to train a separate probabilistic model to obtain posterior probabilities.

  • Using cross-validation ensures that the posterior probabilities for the training data are unbiased,

    as each data point is evaluated by a model not trained on that point.

  • Hyperparameters for the logistic regression classifier are optimized using a grid search with

    cross-validation to improve the model’s performance.

ds_selection_metric(prevalences, test)[source]#

Selects the prevalence estimates from models trained on samples whose distribution of posterior probabilities is most similar to the distribution of posterior probabilities for the test data.

Parameters:
prevalencesnumpy.ndarray

An array of prevalence estimates provided by each model in the ensemble.

testarray-like of shape (n_samples, n_features)

The feature matrix representing the test data.

Returns:
numpy.ndarray

The selected prevalence estimates after applying the DS selection metric.

fit(X, y)[source]#

Fits the ensemble model to the given training data.

Parameters:
Xarray-like of shape (n_samples, n_features)

The input data.

yarray-like of shape (n_samples,)

The target values.

Returns:
selfEnsemble

The fitted ensemble model.

get_metadata_routing()[source]#

Get metadata routing of this object.

Please check User Guide on how the routing mechanism works.

Returns:
routingMetadataRequest

A MetadataRequest encapsulating routing information.

get_params(deep=True)[source]#

Get parameters for this estimator.

Parameters:
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns:
paramsdict

Parameter names mapped to their values.

predict(X)[source]#

Predicts the class prevalences for the given test data.

Parameters:
Xarray-like of shape (n_samples, n_features)

The input data.

Returns:
prevalencesarray-like of shape (n_samples, n_classes)

The predicted class prevalences.

ptr_selection_metric(prevalences)[source]#

Selects the prevalence estimates from models trained on samples whose prevalence is most similar to an initial approximation of the test prevalence as estimated by all models in the ensemble.

Parameters:
prevalencesnumpy.ndarray

An array of prevalence estimates provided by each model in the ensemble.

Returns:
numpy.ndarray

The selected prevalence estimates after applying the PTR selection metric.

set_params(**params)[source]#

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as Pipeline). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters:
**paramsdict

Estimator parameters.

Returns:
selfestimator instance

Estimator instance.

sout(msg)[source]#

Prints a message if verbose is True.