1.1. Aggregative Quantification#

Aggregative quantifiers are a class of quantification methods that aggregates the results of a mid task, such as classification, i.e., the quantifier uses the predicted values (labels or scores) of the classifier to estimate the class distribution of the test set. This types of quantifiers can be separated into two main groups: the mixture models and threshold methods, but there are also other methods that do not fit into these categories, such as:

Other Aggregative Quantifiers#

quantifier

class

reference

Classify and Count

CC

Forman (2005)

Expectation Maximisation for Quantification

EMQ

Saerens et al. (2002)

Probabilistic Classify and Count

PCC

Bella et al. (2010)

Friedman Method

FM

Friedman

Generalized Adjust Count

GAC

Firat (2008)

Generalized Probabilistic Adjusted Count

GPAC

Firat (2008)

Nearest-Neighbor based Quantification

PWK

Barraquero et al. (2013)

An important note is that all these listed methods are multiclass quantifiers, but the mixture models and threshold methods are binary quantifiers.

1.1.1. Classify Count#

The most basic quantification method is the Classify and Count (CC) method, which is a simple approach that counts the number of instances of each class in the test set. The CC method is based on the assumption that the class distribution in the test set is similar to that in the training set. The CC method can be used as a baseline for comparison with other quantification methods. The CC method is implemented in the CC class.

1.1.2. Mixture Models#

Mixture models are binary quantification methods that assume the cumulative distribution of an unknown dataset, or test set, is a mixture of two or more distributions. This concept was first introduced by Forman (2005, 2008).

The base structure of mixture models uses the scores of the training set generated via cross-validation, and combines them to approximate the distribution of the test set. The quantification process is performed by estimating the parameters of the mixture using the training set scores and then applying the model to the test set scores.

The library implements the following mixture models:

Implemented Mixture Models#

quantifier

class

reference

Distribution y-Similarity

DyS

Maletzke et al. (2019)

Synthetic Distribution y-Similarity

DySsyn

Maletzke et al. (2021)

Hellinger Distance Minimization

HDy

González-Castro et al. (2013)

Sample Mean Matching

SMM

Hassan et al. (2013)

Sample Ordinal Distance

SORD

Maletzke et al. (2019)

Some algorithms such as DyS, Dyssyn and HDy are based on distances between the mixture of the train scores and the test scores. Hdy for example, uses the Hellinger distance to measure the difference between the two distributions. All the distances can be accessed through the mlquantify.utils.method module, with 4 different distance functions implemented in the library:

These methods also have the best_distance method, which allows you to obtain the best distance computed by the method. Below is an example of how to use this approach:

from mlquantify.methods import DyS
from sklearn.linear_model import LogisticRegression
import numpy as np

X_train = np.random.rand(100, 10)
y_train = np.random.randint(0, 2, size=100)
X_test = np.random.rand(50, 10)

quantifier = DyS(LogisticRegression())
quantifier.fit(X_train, y_train)
distance = quantifier.best_distance(X_test)
print(distance)

1.1.3. Threshold Methods#

The threshold methods are also binary quantifiers (i.e., multiclass problems have not been implemented yet). Proposed by Forman (2005, 2008), these algorithms work by adjusting the outputs of a classifier to obtain the class distribution of a test set. Most methods uses a table of thresholds (i.e. 0.0, 0.1, 0.2, …, 1.0) along with the TPR (True Positive Rate) and FPR (False Positive Rate) to estimate the class distribution of the test set. Each different quantifier uses these values in a different way.

The library implements the following threshold methods:

Implemented Threshold Methods#

quantifier

class

reference

Adjusted Classify and Count or Adjusted Count

ACC

Forman (2005)

Threshold MAX

MAX

Forman (2008)

Median Sweep

MS

Forman (2005)

Median Sweep 2

MS2

Forman (2005)

Probabilistic Adjusted Classify and Count

PACC

Bella et al. (2010)

Threshold 50

T50

Forman (2005)

Threshold X

X_method

Forman (2005)

You can compute the table of tpr and fpr for each threshold value using the adjust_threshold function. This function takes as input the true labels and the predicted probabilities of the positive class generated via cross validation, and returns a table with the thresholds, TPR, and FPR values.

from mlquantify.utils.method import adjust_threshold, get_scores
from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np

X = np.random.rand(200, 10)
y = np.random.randint(0, 2, size=200) # Random binary labels [0, 1]
classes = np.unique(y)

model = LogisticRegression() # Example model, replace with your own

# Fit the model to the data via cross validation
y_labels, probabilities = get_scores(X=X, y=y, learner=model, folds=10, learner_fitted=False)
probabilities = probabilities[:, 1] # Get the probabilities for the positive class

thresholds, tprs, fprs = adjust_threshold(y=y_labels, probabilities=probabilities, classes=classes)

table = pd.DataFrame({
"Threshold": thresholds,
"TPR": tprs,
"FPR": fprs
})

print(table)

Or use the compute_table with the compute_tpr and compute_fpr function to get the TPR and FPR values manually:

from mlquantify.utils.method import compute_table, compute_tpr, compute_fpr
from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np

X = np.random.rand(200, 10)
y = np.random.randint(0, 2, size=200) # Random binary labels [0, 1]
classes = np.unique(y)

model = LogisticRegression()
model.fit(X, y)
y_pred = model.predict(X)

TP, FP, FN, TN = compute_table(y, y_pred, classes)
tpr = compute_tpr(TP, FN)
fpr = compute_fpr(FP, TN)
print("True Positive Rate (TPR):", tpr)
print("False Positive Rate (FPR):", fpr)

1.1.4. Expectation Maximisation for Quantification#

The Expectation Maximisation (EM) is an iterative algorithm that is used to find the maximum likelihood estimates of parameters (our case is the class distribution)for models that depend on unobserved latent variables. The algorithm consists by incrementally updating the posterior probabiblites by using the class prevalence values computed in the last step of the iteration, and updates the class prevalence values by using the posterior probabilities computed in the last step of the iteration, in a mutually recursive fashion.

The method is available in EMQ class, and can be used as a probabilistic classifier, using the predict_proba method, or as a quantifier, example below:

from mlquantify.methods.aggregative import EMQ
from sklearn.linear_model import LogisticRegression
import numpy as np

X_train = np.random.rand(100, 10)
y_train = np.random.randint(0, 2, size=100)
X_test = np.random.rand(50, 10)

quantifier = EMQ(LogisticRegression())
quantifier.fit(X_train, y_train)

class_distribution = quantifier.predict(X_test)
scores = quantifier.predict_proba(X_test)

print("Class distribution:", class_distribution)
print("Scores:", scores)