1.4. Mixture Models#

Mixture Model (MM) methods, often referred to as Distribution Matching (DM) methods, constitute one of the main families of quantification algorithms [1].

Mixture Models seeks to model the data distribution observed in the test set as a parametric mixture of the individual class distributions obtained from training \(L\).

Note

Mixture Models are predominantly designed for binary quantification problems. While extensions to multi-class scenarios exist, such as one-vs-all strategies, they are computationally intensive and less commonly used. If you are dealing with multi-class quantification, consider using methods from the density_module with better scalability.

1.4.1. DyS: Distribution y-Similarity Framework#

DyS is a generic framework that formalizes the Mixture Models approach. The term y-Similarity indicates that it compares the similarity of classification score distributions (y-space) [1].

DyS depends on two critical factors: the method used to represent the distributions (such as histograms or means) and the dissimilarity function (DS) used to compare the test distribution with the mixture distribution.

DyS seeks the prevalence parameter \(\alpha\) that minimizes the dissimilarity (\(DS\)) between the test score distribution (\(f_U\)) and the mixture of training score distributions weighted by \(\alpha\).

1.4.2. HDy: Hellinger Distance y-Similarity#

HDy is a specific and popular instance of the DyS framework and a variant of Forman’s original MM, proposed by [1].

What HDy does:

Representation: HDy uses normalized histograms (PDF estimates) of posterior probabilities (y-scores) to represent the training class distributions and the test set distribution.
Mixture: It models the test histogram (\(Q\)) as a mixture of the positive histogram (\(P_+\)) and the negative histogram (\(P_-\)), weighted by the parameter \(\hat{p}\).

3. Comparison: HDy uses the Hellinger Distance (HD) as the dissimilarity metric to find the value \(\hat{p}\) that minimizes the distance between the mixture and the test distribution. 4.

Example

from mlquantify.mixture import HDy
from sklearn.ensemble import RandomForestClassifier

q = HDy(learner=RandomForestClassifier(), bins=10)
q.fit(X_train, y_train)
q.predict(X_test)

1.4.3. SMM: Sample Mean Matching#

SMM is a member of the DyS framework, proposed by [4] notable for its simplicity and efficiency, located at the SMM.

What SMM does:

Representation: Instead of using histograms (like HDy) or CDFs, SMM represents the score distributions of positive (\(S_{\oplus}\)), negative (\(S_{\ominus}\)), and test (\(S_U\)) classes by a single scalar statistic: the mean (:math:`mu`) of the scores.
Optimization: SMM assumes that the mean of the test scores is the weighted sum of the training score means.
Closed Form Solution: SMM does not require iteration or complex search procedures, as the problem can be solved in closed form.

Note

SMM is mathematically equivalent to the PACC (Probabilistic Adjusted Classify & Count) method [4].

Example

from mlquantify.mixture import SMM
q = SMM(learner=LogisticRegression())
q.fit(X_train, y_train)
q.predict(X_test)

1.4.4. SORD: Sample Ordinal Distance#

SORD (Sample Ordinal Distance) is one of the dissimilarity functions that fall under the DyS framework, located at SORD.

SORD is notable for operating directly on score samples (observations) rather than discretized distributions (histograms). By eliminating the dependency on the number of bins, it seeks the minimum cost to transform a sample of scores into the weighted mixture sample. SORD provides an alternative that does not lose details after bin discretization [5].

1.4. Mixture Models#

1.4.1. DyS: Distribution y-Similarity Framework#

1.4.2. HDy: Hellinger Distance y-Similarity#

1.4.3. SMM: Sample Mean Matching#

1.4.4. SORD: Sample Ordinal Distance#

This Page