1.1. Aggregative Quantification#
Aggregative quantifiers are a class of quantification methods that aggregates the results of a mid task, such as classification, i.e., the quantifier uses the predicted values (labels or scores) of the classifier to estimate the class distribution of the test set. This types of quantifiers can be separated into two main groups: the mixture models and threshold methods, but there are also other methods that do not fit into these categories, such as:
quantifier |
class |
reference |
---|---|---|
Classify and Count |
||
Expectation Maximisation for Quantification |
||
Probabilistic Classify and Count |
||
Friedman Method |
||
Generalized Adjust Count |
||
Generalized Probabilistic Adjusted Count |
||
Nearest-Neighbor based Quantification |
An important note is that all these listed methods are multiclass quantifiers, but the mixture models and threshold methods are binary quantifiers.
1.1.1. Classify Count#
The most basic quantification method is the Classify and Count (CC) method, which is a simple approach that counts the number of instances of each class in the test set. The CC method is based on the assumption that the class distribution in the test set is similar to that in the training set. The CC method can be used as a baseline for comparison with other quantification methods.
The CC method is implemented in the CC
class.
1.1.2. Mixture Models#
Mixture models are binary quantification methods that assume the cumulative distribution of an unknown dataset, or test set, is a mixture of two or more distributions. This concept was first introduced by Forman (2005, 2008).
The base structure of mixture models uses the scores of the training set generated via cross-validation, and combines them to approximate the distribution of the test set. The quantification process is performed by estimating the parameters of the mixture using the training set scores and then applying the model to the test set scores.
The library implements the following mixture models:
quantifier |
class |
reference |
---|---|---|
Distribution y-Similarity |
||
Synthetic Distribution y-Similarity |
||
Hellinger Distance Minimization |
||
Sample Mean Matching |
||
Sample Ordinal Distance |
Some algorithms such as DyS, Dyssyn and HDy are based on distances between the mixture of the train scores and the test scores. Hdy for example, uses the Hellinger distance to measure the difference between the two distributions. All the distances can be accessed through the mlquantify.utils.method
module, with 4 different distance functions implemented in the library:
These methods also have the best_distance
method, which allows you to obtain the best distance computed by the method. Below is an example of how to use this approach:
from mlquantify.methods import DyS
from sklearn.linear_model import LogisticRegression
import numpy as np
X_train = np.random.rand(100, 10)
y_train = np.random.randint(0, 2, size=100)
X_test = np.random.rand(50, 10)
quantifier = DyS(LogisticRegression())
quantifier.fit(X_train, y_train)
distance = quantifier.best_distance(X_test)
print(distance)
1.1.3. Threshold Methods#
The threshold methods are also binary quantifiers (i.e., multiclass problems have not been implemented yet). Proposed by Forman (2005, 2008), these algorithms work by adjusting the outputs of a classifier to obtain the class distribution of a test set. Most methods uses a table of thresholds (i.e. 0.0, 0.1, 0.2, …, 1.0) along with the TPR (True Positive Rate) and FPR (False Positive Rate) to estimate the class distribution of the test set. Each different quantifier uses these values in a different way.
The library implements the following threshold methods:
quantifier |
class |
reference |
---|---|---|
Adjusted Classify and Count or Adjusted Count |
||
Threshold MAX |
||
Median Sweep |
||
Median Sweep 2 |
||
Probabilistic Adjusted Classify and Count |
||
Threshold 50 |
||
Threshold X |
You can compute the table of tpr and fpr for each threshold value using the adjust_threshold
function. This function takes as input the true labels and the predicted probabilities of the positive class generated via cross validation, and returns a table with the thresholds, TPR, and FPR values.
from mlquantify.utils.method import adjust_threshold, get_scores
from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np
X = np.random.rand(200, 10)
y = np.random.randint(0, 2, size=200) # Random binary labels [0, 1]
classes = np.unique(y)
model = LogisticRegression() # Example model, replace with your own
# Fit the model to the data via cross validation
y_labels, probabilities = get_scores(X=X, y=y, learner=model, folds=10, learner_fitted=False)
probabilities = probabilities[:, 1] # Get the probabilities for the positive class
thresholds, tprs, fprs = adjust_threshold(y=y_labels, probabilities=probabilities, classes=classes)
table = pd.DataFrame({
"Threshold": thresholds,
"TPR": tprs,
"FPR": fprs
})
print(table)
Or use the compute_table
with the compute_tpr
and compute_fpr
function to get the TPR and FPR values manually:
from mlquantify.utils.method import compute_table, compute_tpr, compute_fpr
from sklearn.linear_model import LogisticRegression
import pandas as pd
import numpy as np
X = np.random.rand(200, 10)
y = np.random.randint(0, 2, size=200) # Random binary labels [0, 1]
classes = np.unique(y)
model = LogisticRegression()
model.fit(X, y)
y_pred = model.predict(X)
TP, FP, FN, TN = compute_table(y, y_pred, classes)
tpr = compute_tpr(TP, FN)
fpr = compute_fpr(FP, TN)
print("True Positive Rate (TPR):", tpr)
print("False Positive Rate (FPR):", fpr)
1.1.4. Expectation Maximisation for Quantification#
The Expectation Maximisation (EM) is an iterative algorithm that is used to find the maximum likelihood estimates of parameters (our case is the class distribution)for models that depend on unobserved latent variables. The algorithm consists by incrementally updating the posterior probabiblites by using the class prevalence values computed in the last step of the iteration, and updates the class prevalence values by using the posterior probabilities computed in the last step of the iteration, in a mutually recursive fashion.
The method is available in EMQ
class, and can be used as a probabilistic classifier, using the predict_proba
method, or as a quantifier, example below:
from mlquantify.methods.aggregative import EMQ
from sklearn.linear_model import LogisticRegression
import numpy as np
X_train = np.random.rand(100, 10)
y_train = np.random.randint(0, 2, size=100)
X_test = np.random.rand(50, 10)
quantifier = EMQ(LogisticRegression())
quantifier.fit(X_train, y_train)
class_distribution = quantifier.predict(X_test)
scores = quantifier.predict_proba(X_test)
print("Class distribution:", class_distribution)
print("Scores:", scores)