1.1. Aggregative Quantification#

Aggregative quantifiers are a class of quantification methods that aggregates the results of a mid task, such as classification, i.e., the quantifier uses the predicted values (labels or scores) of the classifier to estimate the class distribution of the test set. This types of quantifiers can be separated into two main groups: the mixture models and threshold methods, but there are also other methods that do not fit into these categories, such as:

Other Aggregative Quantifiers#
quantifier	class	reference
Classify and Count	CC	Forman (2005)
Expectation Maximisation for Quantification	EMQ	Saerens et al. (2002)
Probabilistic Classify and Count	PCC	Bella et al. (2010)
Friedman Method	FM	Friedman
Generalized Adjust Count	GAC	Firat (2008)
Generalized Probabilistic Adjusted Count	GPAC	Firat (2008)
Nearest-Neighbor based Quantification	PWK	Barraquero et al. (2013)

An important note is that all these listed methods are multiclass quantifiers, but the mixture models and threshold methods are binary quantifiers.

1.1.1. Classify Count#

The most basic quantification method is the Classify and Count (CC) method, which is a simple approach that counts the number of instances of each class in the test set. The CC method is based on the assumption that the class distribution in the test set is similar to that in the training set. The CC method can be used as a baseline for comparison with other quantification methods. The CC method is implemented in the CC class.

1.1.2. Mixture Models#

Mixture models are binary quantification methods that assume the cumulative distribution of an unknown dataset, or test set, is a mixture of two or more distributions. This concept was first introduced by Forman (2005, 2008).

The base structure of mixture models uses the scores of the training set generated via cross-validation, and combines them to approximate the distribution of the test set. The quantification process is performed by estimating the parameters of the mixture using the training set scores and then applying the model to the test set scores.

The library implements the following mixture models:

Implemented Mixture Models#
quantifier	class	reference
Distribution y-Similarity	DyS	Maletzke et al. (2019)
Synthetic Distribution y-Similarity	DySsyn	Maletzke et al. (2021)
Hellinger Distance Minimization	HDy	González-Castro et al. (2013)
Sample Mean Matching	SMM	Hassan et al. (2013)
Sample Ordinal Distance	SORD	Maletzke et al. (2019)

Some algorithms such as DyS, Dyssyn and HDy are based on distances between the mixture of the train scores and the test scores. Hdy for example, uses the Hellinger distance to measure the difference between the two distributions. All the distances can be accessed through the mlquantify.utils.method module, with 4 different distance functions implemented in the library:

These methods also have the best_distance method, which allows you to obtain the best distance computed by the method. Below is an example of how to use this approach:

from mlquantify.methods import DyS
from sklearn.linear_model import LogisticRegression
import numpy as np

X_train = np.random.rand(100, 10)
y_train = np.random.randint(0, 2, size=100)
X_test = np.random.rand(50, 10)

quantifier = DyS(LogisticRegression())
quantifier.fit(X_train, y_train)
distance = quantifier.best_distance(X_test)
print(distance)

1.1.3. Threshold Methods#

The threshold methods are also binary quantifiers (i.e., multiclass problems have not been implemented yet). Proposed by Forman (2005, 2008), these algorithms work by adjusting the outputs of a classifier to obtain the class distribution of a test set. Most methods uses a table of thresholds (i.e. 0.0, 0.1, 0.2, …, 1.0) along with the TPR (True Positive Rate) and FPR (False Positive Rate) to estimate the class distribution of the test set. Each different quantifier uses these values in a different way.

The library implements the following threshold methods:

Implemented Threshold Methods#
quantifier	class	reference
Adjusted Classify and Count or Adjusted Count	ACC	Forman (2005)
Threshold MAX	MAX	Forman (2008)
Median Sweep	MS	Forman (2005)
Median Sweep 2	MS2	Forman (2005)
Probabilistic Adjusted Classify and Count	PACC	Bella et al. (2010)
Threshold 50	T50	Forman (2005)
Threshold X	X_method	Forman (2005)

You can compute the table of tpr and fpr for each threshold value using the adjust_threshold function. This function takes as input the true labels and the predicted probabilities of the positive class generated via cross validation, and returns a table with the thresholds, TPR, and FPR values.

from mlquantify.utils.method import adjust_threshold, get_scores
from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np

X = np.random.rand(200, 10)
y = np.random.randint(0, 2, size=200) # Random binary labels [0, 1]
classes = np.unique(y)

model = LogisticRegression() # Example model, replace with your own

# Fit the model to the data via cross validation
y_labels, probabilities = get_scores(X=X, y=y, learner=model, folds=10, learner_fitted=False)
probabilities = probabilities[:, 1] # Get the probabilities for the positive class

thresholds, tprs, fprs = adjust_threshold(y=y_labels, probabilities=probabilities, classes=classes)

table = pd.DataFrame({
"Threshold": thresholds,
"TPR": tprs,
"FPR": fprs
})

print(table)

Or use the compute_table with the compute_tpr and compute_fpr function to get the TPR and FPR values manually:

from mlquantify.utils.method import compute_table, compute_tpr, compute_fpr
from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np

X = np.random.rand(200, 10)
y = np.random.randint(0, 2, size=200) # Random binary labels [0, 1]
classes = np.unique(y)

model = LogisticRegression()
model.fit(X, y)
y_pred = model.predict(X)

TP, FP, FN, TN = compute_table(y, y_pred, classes)
tpr = compute_tpr(TP, FN)
fpr = compute_fpr(FP, TN)
print("True Positive Rate (TPR):", tpr)
print("False Positive Rate (FPR):", fpr)

1.1.4. Expectation Maximisation for Quantification#

The Expectation Maximisation (EM) is an iterative algorithm that is used to find the maximum likelihood estimates of parameters (our case is the class distribution)for models that depend on unobserved latent variables. The algorithm consists by incrementally updating the posterior probabiblites by using the class prevalence values computed in the last step of the iteration, and updates the class prevalence values by using the posterior probabilities computed in the last step of the iteration, in a mutually recursive fashion.

The method is available in EMQ class, and can be used as a probabilistic classifier, using the predict_proba method, or as a quantifier, example below:

from mlquantify.methods.aggregative import EMQ
from sklearn.linear_model import LogisticRegression
import numpy as np

X_train = np.random.rand(100, 10)
y_train = np.random.randint(0, 2, size=100)
X_test = np.random.rand(50, 10)

quantifier = EMQ(LogisticRegression())
quantifier.fit(X_train, y_train)

class_distribution = quantifier.predict(X_test)
scores = quantifier.predict_proba(X_test)

print("Class distribution:", class_distribution)
print("Scores:", scores)

1.1. Aggregative Quantification#

1.1.1. Classify Count#

1.1.2. Mixture Models#

1.1.3. Threshold Methods#

1.1.4. Expectation Maximisation for Quantification#

This Page