.. _mixture_models_non_agg: .. currentmodule:: mlquantify.mixture ======================================================== Mixture Models for Non-Aggregative Quantification ======================================================== Currently, the only Mixture Model method specifically designed for non-aggregative quantification is **HDx (Hellinger Distance x-Similarity)**, found at :class:`HDx`. HDx: Hellinger Distance x-Similarity ==================================== **HDx** is a non-aggregative quantification method based on :class:`HDy` [1]_. While HDy operates on posterior probabilities (y-space), HDx works directly in the feature space (x-space), without aggregating predictions. The goal of HDx is to estimate the prevalence parameter \(\alpha\) that minimizes the average Hellinger Distance between the empirical feature distribution of the test set and a convex mixture of the class-conditional feature distributions from training data. .. dropdown:: Mathematical Definition .. math:: V_\alpha(x) = \alpha \cdot p(x|+) + (1 - \alpha) \cdot p(x|-) .. math:: \alpha^* = \underset{0 \leq \alpha \leq 1}{\arg\min}\; \frac{1}{n_f} \sum_{f=1}^{n_f} HD_f(V^\alpha, U) .. math:: \frac{|V_{f,i}|}{|V|} = \frac{|S^+_{f,i}|}{|S^+|} \cdot \alpha + \frac{|S^-_{f,i}|}{|S^-|} \cdot (1 - \alpha) where: - :math:`V_\alpha(x)`: mixture distribution for feature :math:`x` parameterized by :math:`\alpha`, - :math:`p(x|+), p(x|-)`: class-conditional feature distributions from training, - :math:`HD_f`: Hellinger distance for each feature :math:`f`, - :math:`U`: empirical test distribution, - :math:`|S^+_{f,i}|`, :math:`|S^-_{f,i}|`: counts of positive/negative training samples in bin :math:`i` of feature :math:`f`, - :math:`n_f`: number of features. HDx, different from HDy, does not require a learner to estimate posterior probabilities, as it operates directly in the feature space, so it does not have a `aggregate` method. .. code-block:: python from mlquantify.mixture import HDx from sklearn.ensemble import RandomForestClassifier q = HDx(bins_size=[10, 20, 30]) q.fit(X_train, y_train) q.predict(X_test) .. dropdown:: References .. [1] González-Castro, V., Alaiz-Rodríguez, R., & Alegre, E. (2013). Class distribution estimation based on the Hellinger distance. Information Sciences, 218, 146-164. https://doi.org/10.1016/j.ins.2012.05.028