2.1. Mixture Models for Non-Aggregative Quantification#

Currently, the only Mixture Model method specifically designed for non-aggregative quantification is HDx (Hellinger Distance x-Similarity), found at HDx.

2.1.1. HDx: Hellinger Distance x-Similarity#

HDx is a non-aggregative quantification method based on HDy [1]. While HDy operates on posterior probabilities (y-space), HDx works directly in the feature space (x-space), without aggregating predictions.

The goal of HDx is to estimate the prevalence parameter (alpha) that minimizes the average Hellinger Distance between the empirical feature distribution of the test set and a convex mixture of the class-conditional feature distributions from training data.

Mathematical Definition
\[V_\alpha(x) = \alpha \cdot p(x|+) + (1 - \alpha) \cdot p(x|-)\]
\[\alpha^* = \underset{0 \leq \alpha \leq 1}{\arg\min}\; \frac{1}{n_f} \sum_{f=1}^{n_f} HD_f(V^\alpha, U)\]
\[\frac{|V_{f,i}|}{|V|} = \frac{|S^+_{f,i}|}{|S^+|} \cdot \alpha + \frac{|S^-_{f,i}|}{|S^-|} \cdot (1 - \alpha)\]

where:

  • \(V_\alpha(x)\): mixture distribution for feature \(x\) parameterized by \(\alpha\),

  • \(p(x|+), p(x|-)\): class-conditional feature distributions from training,

  • \(HD_f\): Hellinger distance for each feature \(f\),

  • \(U\): empirical test distribution,

  • \(|S^+_{f,i}|\), \(|S^-_{f,i}|\): counts of positive/negative training samples in bin \(i\) of feature \(f\),

  • \(n_f\): number of features.

HDx, different from HDy, does not require a learner to estimate posterior probabilities, as it operates directly in the feature space, so it does not have a aggregate method.

from mlquantify.mixture import HDx
from sklearn.ensemble import RandomForestClassifier

q = HDx(bins_size=[10, 20, 30])
q.fit(X_train, y_train)
q.predict(X_test)
References