2.1. Mixture Models for Non-Aggregative Quantification#
Currently, the only Mixture Model method specifically designed for non-aggregative quantification is HDx (Hellinger Distance x-Similarity), found at HDx.
2.1.1. HDx: Hellinger Distance x-Similarity#
HDx is a non-aggregative quantification method based on HDy [1]. While HDy operates on posterior probabilities (y-space), HDx works directly in the feature space (x-space), without aggregating predictions.
The goal of HDx is to estimate the prevalence parameter (alpha) that minimizes the average Hellinger Distance between the empirical feature distribution of the test set and a convex mixture of the class-conditional feature distributions from training data.
Mathematical Definition
where:
\(V_\alpha(x)\): mixture distribution for feature \(x\) parameterized by \(\alpha\),
\(p(x|+), p(x|-)\): class-conditional feature distributions from training,
\(HD_f\): Hellinger distance for each feature \(f\),
\(U\): empirical test distribution,
\(|S^+_{f,i}|\), \(|S^-_{f,i}|\): counts of positive/negative training samples in bin \(i\) of feature \(f\),
\(n_f\): number of features.
HDx, different from HDy, does not require a learner to estimate posterior probabilities, as it operates directly in the feature space, so it does not have a aggregate method.
from mlquantify.mixture import HDx
from sklearn.ensemble import RandomForestClassifier
q = HDx(bins_size=[10, 20, 30])
q.fit(X_train, y_train)
q.predict(X_test)