2.7. Nearest Neighbours#

Nearest-neighbour quantifiers estimate class prevalences by leveraging the local structure of the feature space. They are non-parametric, require no distributional assumptions, and are naturally robust to non-linear decision boundaries.


2.7.1. PWK — Probabilistic Weighted k-Nearest Neighbours#

PWK (Barranquero et al., 2013) wraps a k-NN classifier with a class-imbalance-aware weighting scheme and then applies Classify and Count (CC) on its predictions. Each neighbour’s vote is multiplied by a class-specific weight that corrects for the size difference between classes, so that the minority class is not drowned out by the majority.

The weight for class \(c\) is:

\[w_c(\alpha) = \left(\frac{M}{m_c}\right)^{1/\alpha}\]

where \(M = \min_c m_c\) is the smallest class size and \(\alpha\) is the imbalance-correction exponent.

Special cases:

  • \(\alpha = 1\) → standard PWK (weights proportional to inverse class size).

  • \(\alpha \to \infty\) → all weights equal to 1 (standard k-NN).

Why it exists: A standard k-NN classifier biases predictions towards the majority class. PWK’s weighting neutralises this bias, producing more accurate CC estimates under class imbalance. Barranquero et al. (2013) showed PWK outperforms standard CC + k-NN by a wide margin on imbalanced datasets.

2.7.1.1. Parameters#

Parameter

Default

Explanation

alpha

1

Imbalance-correction exponent. Higher values reduce the penalty on larger classes:

  • alpha=1 — standard PWK weighting (most aggressive correction).

  • alpha=2 — gentler correction.

  • Very large alpha (e.g. 100) approaches standard k-NN.

Tune by cross-validation if you are unsure; alpha=1 is a good starting point.

n_neighbors

10

Number of neighbours \(k\). Larger \(k\) gives smoother (lower variance) prevalence estimates. Use larger values on bigger datasets. Common choices: 5, 10, 20.

algorithm

'auto'

k-NN search algorithm. 'auto' selects the fastest available (ball_tree/kd_tree for low-dim data, brute-force for high-dim).

metric

'euclidean'

Distance metric for neighbour search. Use 'euclidean' for continuous normalised features; 'cosine' for sparse text/embedding vectors; 'manhattan' for count data.

leaf_size

30

Leaf size for tree-based algorithms. Affects speed, not accuracy.

p

2

Minkowski distance parameter. p=2 → Euclidean, p=1 → Manhattan.

metric_params

None

Extra keyword arguments for custom metric functions.

n_jobs

None

Parallel jobs for neighbour search. -1 uses all cores.

2.7.1.2. Examples#

Basic usage:

from mlquantify.neighbors import PWK
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

X, y = make_classification(n_samples=1000, weights=[0.8, 0.2],
                           random_state=42)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42)

q = PWK(alpha=1, n_neighbors=10)
q.fit(X_train, y_train)
print(q.predict(X_test))
# {0: 0.80, 1: 0.20}

Tuning alpha and k:

from mlquantify.model_selection import GridSearchQ, APP
from mlquantify.metrics import MAE
from mlquantify.neighbors import PWK

protocol = APP(batch_size=100, n_prevalences=21, repeats=5)
gs = GridSearchQ(
    quantifier=PWK(),
    param_grid={
        'alpha': [1, 2, 5],
        'n_neighbors': [5, 10, 20],
    },
    protocol=protocol,
    error=MAE,
)
gs.fit(X_train, y_train)
print(gs.best_params_)

Getting per-instance classifications:

q = PWK(alpha=1, n_neighbors=10)
q.fit(X_train, y_train)
labels = q.classify(X_test)   # hard labels from the weighted k-NN
print(labels[:10])

2.7.1.3. When to Use PWK#

  • When you want a simple, non-parametric quantifier without tuning a full probabilistic classifier.

  • When feature space distances are meaningful (normalised continuous features).

  • For imbalanced binary problems where a standard classifier would be biased.

Limitation: PWK applies CC on top of the k-NN classifier; it inherits CC’s bias under distributional shift. It does not apply any adjustment like ACC or EMQ. For strong shift correction, use EMQ or DyS instead.

2.7.2. References#

References
  • Barranquero, J., Díez, J., & del Coz, J. J. (2013). On the study of nearest neighbor algorithms for prevalence estimation in binary problems. Pattern Recognition, 46(2), 472–482.

See also

Counting-Based Quantifiers for CC (which PWK builds on top of). Likelihood-Based Quantification for EMQ (stronger shift correction).