EnsembleQ#
- class mlquantify.meta.EnsembleQ(quantifier, size=50, min_prop=0.1, max_prop=1, selection_metric='all', protocol='uniform', p_metric=0.25, return_type='mean', max_sample_size=None, max_trials=100, n_jobs=1, verbose=False)[source]#
Ensemble-based Quantifier combining multiple models trained on varied data samples with controlled prevalence distributions to improve robustness and accuracy.
This quantifier constructs an ensemble of quantification models using batches of training data sampled according to an evaluation protocol (e.g. ‘artificial’, ‘natural’, ‘uniform’, ‘kraemer’) with specified prevalence constraints. Diverse models are trained on these subsamples, and their prevalence estimates aggregated using various selection metrics and aggregation methods.
- Parameters:
- quantifierBaseQuantifier
The quantifier model class to be used for ensemble members.
- sizeint, default=50
Number of ensemble members (sub-models) to train.
- min_prop, max_propfloat, default=(0.1, 1.0)
Minimum and maximum class prevalence proportions for generating training batches.
- selection_metric{‘all’, ‘ptr’, ‘ds’}, default=’all’
Metric used to select or weight ensemble members during aggregation: - ‘all’: uses all models equally, - ‘ptr’: selects models with prevalences closest to initial test prevalence estimates, - ‘ds’: selects models with score distributions similar to test data.
- p_metricfloat, default=0.25
Proportion of ensemble members to select according to the selection metric.
- protocol{‘artificial’, ‘natural’, ‘uniform’, ‘kraemer’}, default=’uniform’
Sampling protocol used to generate training data for ensemble models.
- return_type{‘mean’, ‘median’}, default=’mean’
Aggregation method for ensemble predictions.
- max_sample_sizeint or None, optional
Maximum number of samples per training batch; defaults to dataset size if None.
- max_trialsint, default=100
Maximum number of trials for sampling.
- n_jobsint, default=1
Number of parallel jobs for training ensemble members.
- verbosebool, default=False
Enable verbose output.
- Attributes:
- modelslist
List of fitted quantifier ensemble members.
- train_prevalenceslist
List of training prevalences corresponding to ensemble members.
- train_distributionslist
List of historical training posterior histograms (used when selection_metric=’ds’).
- posteriors_generatorcallable or None
Function to generate posterior probabilities for new samples.
Notes
Ensemble diversity is controlled by sampling prevalences from the specified protocol.
The ‘ds’ selection metric requires probabilistic quantifiers and computes distribution similarity.
Uses sklearn’s LogisticRegression and GridSearchCV internally for posterior computation within ‘ds’.
References
[1]Pérez-Gállego, P., Castaño, A., Ramón Quevedo, J., & José del Coz, J. (2019). Dynamic ensemble selection for quantification tasks. Information Fusion, 45, 1-15. https://doi.org/10.1016/j.inffus.2018.01.001
[2]Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017). Using ensembles for problems with characterizable changes in data distribution: A case study on quantification. Information Fusion, 34, 87-100. https://doi.org/10.1016/j.inffus.2016.07.001
Examples
>>> from mlquantify.ensemble import EnsembleQ >>> from mlquantify.mixture import DyS >>> from sklearn.ensemble import RandomForestClassifier >>> >>> ensemble = EnsembleQ( ... quantifier=DyS(RandomForestClassifier()), ... size=30, ... protocol='artificial', # APP protocol ... selection_metric='ptr' ... ) >>> ensemble.fit(X_train, y_train) >>> prevalence_estimates = ensemble.predict(X_test)
- ds_get_posteriors(X, y)[source]#
Generate posterior probabilities using cross-validated logistic regression. This method computes posterior probabilities for the training data via cross-validation, using a logistic regression classifier with hyperparameters optimized through grid search. It also returns a function to generate posterior probabilities for new data.
- Parameters:
- Xarray-like of shape (n_samples, n_features)
The feature matrix representing the training data.
- yarray-like of shape (n_samples,)
The target vector representing class labels for the training data.
- Returns:
- posteriorsndarray of shape (n_samples, n_classes)
Posterior probabilities for the training data obtained through cross-validation.
- posteriors_generatorcallable
A function that computes posterior probabilities for new input data.
Notes
- In scenarios where the quantifier is not based on a probabilistic classifier, it’s necessary
to train a separate probabilistic model to obtain posterior probabilities.
- Using cross-validation ensures that the posterior probabilities for the training data are unbiased,
as each data point is evaluated by a model not trained on that point.
- Hyperparameters for the logistic regression classifier are optimized using a grid search with
cross-validation to improve the model’s performance.
- ds_selection_metric(X, prevalences, train_distributions, posteriors_generator)[source]#
Selects the prevalence estimates from models trained on samples whose distribution of posterior probabilities is most similar to the distribution of posterior probabilities for the test data.
- Parameters:
- prevalencesnumpy.ndarray
An array of prevalence estimates provided by each model in the ensemble.
- testarray-like of shape (n_samples, n_features)
The feature matrix representing the test data.
- Returns:
- numpy.ndarray
The selected prevalence estimates after applying the DS selection metric.
- fit(X, y)[source]#
Fits the ensemble model to the given training data.
- Parameters:
- Xarray-like of shape (n_samples, n_features)
The input data.
- yarray-like of shape (n_samples,)
The target values.
- Returns:
- selfEnsemble
The fitted ensemble model.
- get_metadata_routing()[source]#
Get metadata routing of this object.
Please check User Guide on how the routing mechanism works.
- Returns:
- routingMetadataRequest
A
MetadataRequestencapsulating routing information.
- get_params(deep=True)[source]#
Get parameters for this estimator.
- Parameters:
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns:
- paramsdict
Parameter names mapped to their values.
- predict(X)[source]#
Predicts the class prevalences for the given test data.
- Parameters:
- Xarray-like of shape (n_samples, n_features)
The input data.
- Returns:
- prevalencesarray-like of shape (n_samples, n_classes)
The predicted class prevalences.
- ptr_selection_metric(prevalences, train_prevalences)[source]#
Selects the prevalence estimates from models trained on samples whose prevalence is most similar to an initial approximation of the test prevalence as estimated by all models in the ensemble.
- Parameters:
- prevalencesnumpy.ndarray
An array of prevalence estimates provided by each model in the ensemble.
- Returns:
- numpy.ndarray
The selected prevalence estimates after applying the PTR selection metric.
- set_params(**params)[source]#
Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as
Pipeline). The latter have parameters of the form<component>__<parameter>so that it’s possible to update each component of a nested object.- Parameters:
- **paramsdict
Estimator parameters.
- Returns:
- selfestimator instance
Estimator instance.