3.1. Ensemble for Quantification#
Ensembles for Quantification (EnsembleQ) represent a class of algorithms aimed at improving the accuracy and robustness of class prevalence estimation by combining multiple base quantifiers trained on varied data samples with controlled prevalence distributions. Different training subsets simulate varying class distributions to introduce diversity in the ensemble, which helps address predictable changes in class priors (Prior Probability Shift or Label Shift).
The algorithm can be divided into three main phases:
Multiple training subsets with varied prevalence \(p_j\) sampled from protocol (‘artificial’, ‘natural’, ‘uniform’, ‘kraemer’).
Each batch trains a base quantifier independently with parameters estimated via cross-validation.
All models predict \(\hat{p}_j\), aggregated via mean/median with optional selection (‘all’, ‘ptr’, ‘ds’).
Advantages include risk reduction, correction of instability in base quantifiers, and resilience to widely varying test prevalence.
Mathematical Definition
Given training class-conditional feature distributions \(p(x|+)\) and \(p(x|-)\) and an unlabeled test set \(U\), each training batch simulates a mixture distribution:
A diversity of prevalence values \(\alpha\) is sampled according to the chosen protocol to generate training batches \(D_j\). Each base quantifier is trained on these batches.
Final ensemble prevalence estimate \(\hat{p}_{final}\) is computed as:
where aggregation is typically mean or median, optionally weighted by selection metrics.
Selection policies used during aggregation:
‘all’: Uses all ensemble members equally without any selection or weighting.
‘ptr’ (Prevalence Training Ratio): Selects models whose training prevalence \(p_j\) is closest to an initial prevalence estimate of the test set, often computed as the mean of all base predictions.
‘ds’ (Distribution Similarity): Selects models whose training posterior score distributions are most similar to the test set distribution, measured with metrics such as Hellinger Distance. This requires probabilistic quantifiers capable of producing posterior probabilities.
Example
from mlquantify.ensemble import EnsembleQ
from mlquantify.mixture import DyS
from sklearn.ensemble import RandomForestClassifier
ensemble = EnsembleQ(
quantifier=DyS(RandomForestClassifier()),
size=30,
protocol='artificial',
selection_metric='ptr'
)
ensemble.fit(X_train, y_train)
prevalence_estimates = ensemble.predict(X_test)