Ensemble#
- class mlquantify.methods.meta.Ensemble(quantifier: Quantifier, size: int = 50, min_prop: float = 0.1, selection_metric: str = 'all', p_metric: float = 0.25, return_type: str = 'mean', max_sample_size: int | None = None, max_trials: int = 100, n_jobs: int = 1, verbose: bool = False)[source]#
Ensemble of Quantification Models.
This class implements an ensemble of quantification methods, allowing parallel processing for evaluation. The ensemble method is based on the articles by Pérez-Gállego et al. (2017, 2019).
This approach of Ensemble is made of taking multiple samples varying class proportions on each, and for the predictions, it takes the k models which as the minimum seletion metric
- Parameters:
- quantifierQuantifier
The base quantifier model to be used in the ensemble.
- sizeint, optional (default=50)
The number of samples to be generated for the ensemble.
- min_propfloat, optional (default=0.1)
The minimum proportion of each class in the generated samples.
- selection_metricstr, optional (default=’all’)
The metric used for selecting the best models in the ensemble. Valid options are ‘all’, ‘ptr’, and ‘ds’.
- p_metricfloat, optional (default=0.25)
The proportion of models to be selected based on the selection metric.
- return_typestr, optional (default=’mean’)
The type of aggregation to be used for the final prediction. Valid options are ‘mean’ and ‘median’.
- max_sample_sizeint or None, optional (default=None)
The maximum size of the samples to be generated. If None, the entire dataset is used.
- max_trialsint, optional (default=100)
The maximum number of trials to generate valid samples.
- n_jobsint, optional (default=1)
The number of parallel jobs to run.
- verbosebool, optional (default=False)
If True, prints progress messages during fitting and prediction.
- Attributes:
- base_quantifierQuantifier
The base quantifier model to be used in the ensemble.
- sizeint
The number of samples to be generated for the ensemble.
- min_propfloat
The minimum proportion of each class in the generated samples.
- selection_metricstr
The metric used for selecting the best models in the ensemble. Valid options are ‘all’, ‘ptr’, and ‘ds’. - all -> return all the predictions - ptr -> computes the selected error measure - ds -> computes the hellinger distance of the train and test
distributions for each model
- p_metricfloat
The proportion of models to be selected based on the selection metric.
- return_typestr
The type of aggregation to be used for the final prediction. Valid options are ‘mean’ and ‘median’.
- max_sample_sizeint or None
The maximum size of the samples to be generated. If None, the entire dataset is used.
- max_trialsint
The maximum number of trials to generate valid samples.
- n_jobsint
The number of parallel jobs to run.
- verbosebool
If True, prints progress messages during fitting and prediction.
See also
joblib.Parallel
Parallel processing utility for Python.
References
[1]PÉREZ-GÁLLEGO, Pablo; QUEVEDO, José Ramón; DEL COZ, Juan José. Using ensembles for problems with characterizable changes in data distribution: A case study on quantification. Information Fusion, v. 34, p. 87-100, 2017. Avaliable at https://www.sciencedirect.com/science/article/abs/pii/S1566253516300628?casa_token=XblH-3kwhf4AAAAA:oxNRiCdHZQQa1C8BCJM5PBnFrd26p8-9SSBdm8Luf1Dm35w88w0NdpvoCf1RxBBqtshjyAhNpsDd
[2]PÉREZ-GÁLLEGO, Pablo et al. Dynamic ensemble selection for quantification tasks. Information Fusion, v. 45, p. 1-15, 2019. Avaliable at https://www.sciencedirect.com/science/article/abs/pii/S1566253517303652?casa_token=jWmc592j5uMAAAAA:2YNeZGAGD0NJEMkcO-YBr7Ak-Ik7njLEcG8SKdowLdpbJ0mwPjYKKiqvQ-C3qICG8yU0m4xUZ3Yv
Examples
>>> from mlquantify.methods import FM, Ensemble >>> from mlquantify.utils.general import get_real_prev >>> from sklearn.ensemble import RandomForestClassifier >>> from sklearn.datasets import load_breast_cancer >>> from sklearn.model_selection import train_test_split >>> >>> features, target = load_breast_cancer(return_X_y=True) >>> >>> X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.3) >>> >>> model = FM(RandomForestClassifier()) >>> ensemble = Ensemble(quantifier=model, ... size=50, ... selection_metric='ptr', ... return_type='median', ... n_jobs=-1, ... verbose=False) >>> >>> ensemble.fit(X_train, y_train) >>> >>> predictions = ensemble.predict(X_test) >>> predictions {0: 0.4589857954621449, 1: 0.5410142045378551} >>> get_real_prev(y_test) {0: 0.45614035087719296, 1: 0.543859649122807}
- ds_get_posteriors(X, y)[source]#
Generate posterior probabilities using cross-validated logistic regression. This method computes posterior probabilities for the training data via cross-validation, using a logistic regression classifier with hyperparameters optimized through grid search. It also returns a function to generate posterior probabilities for new data.
- Parameters:
- Xarray-like of shape (n_samples, n_features)
The feature matrix representing the training data.
- yarray-like of shape (n_samples,)
The target vector representing class labels for the training data.
- Returns:
- posteriorsndarray of shape (n_samples, n_classes)
Posterior probabilities for the training data obtained through cross-validation.
- posteriors_generatorcallable
A function that computes posterior probabilities for new input data.
Notes
- In scenarios where the quantifier is not based on a probabilistic classifier, it’s necessary
to train a separate probabilistic model to obtain posterior probabilities.
- Using cross-validation ensures that the posterior probabilities for the training data are unbiased,
as each data point is evaluated by a model not trained on that point.
- Hyperparameters for the logistic regression classifier are optimized using a grid search with
cross-validation to improve the model’s performance.
- ds_selection_metric(prevalences, test)[source]#
Selects the prevalence estimates from models trained on samples whose distribution of posterior probabilities is most similar to the distribution of posterior probabilities for the test data.
- Parameters:
- prevalencesnumpy.ndarray
An array of prevalence estimates provided by each model in the ensemble.
- testarray-like of shape (n_samples, n_features)
The feature matrix representing the test data.
- Returns:
- numpy.ndarray
The selected prevalence estimates after applying the DS selection metric.
- fit(X, y)[source]#
Fits the ensemble model to the given training data.
- Parameters:
- Xarray-like of shape (n_samples, n_features)
The input data.
- yarray-like of shape (n_samples,)
The target values.
- Returns:
- selfEnsemble
The fitted ensemble model.
- get_metadata_routing()[source]#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequest
encapsulating routing information.
- get_params(deep=True)[source]#
Get parameters for this estimator.
- Parameters:
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
- paramsdict
Parameter names mapped to their values.
- predict(X)[source]#
Predicts the class prevalences for the given test data.
- Parameters:
- Xarray-like of shape (n_samples, n_features)
The input data.
- Returns:
- prevalencesarray-like of shape (n_samples, n_classes)
The predicted class prevalences.
- ptr_selection_metric(prevalences)[source]#
Selects the prevalence estimates from models trained on samples whose prevalence is most similar to an initial approximation of the test prevalence as estimated by all models in the ensemble.
- Parameters:
- prevalencesnumpy.ndarray
An array of prevalence estimates provided by each model in the ensemble.
- Returns:
- numpy.ndarray
The selected prevalence estimates after applying the PTR selection metric.
- set_params(**params)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline
). The latter have parameters of the form<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters:
- **paramsdict
Estimator parameters.
- Returns:
- selfestimator instance
Estimator instance.