.. _nearest_neighbors: .. currentmodule:: mlquantify.neighbors ================== Nearest Neighbours ================== Nearest-neighbour quantifiers estimate class prevalences by leveraging the local structure of the feature space. They are non-parametric, require no distributional assumptions, and are naturally robust to non-linear decision boundaries. .. contents:: Contents :local: :depth: 2 ---- PWK — Probabilistic Weighted k-Nearest Neighbours =================================================== :class:`PWK` (Barranquero et al., 2013) wraps a k-NN classifier with a **class-imbalance-aware weighting scheme** and then applies Classify and Count (CC) on its predictions. Each neighbour's vote is multiplied by a class-specific weight that corrects for the size difference between classes, so that the minority class is not drowned out by the majority. The weight for class :math:`c` is: .. math:: w_c(\alpha) = \left(\frac{M}{m_c}\right)^{1/\alpha} where :math:`M = \min_c m_c` is the smallest class size and :math:`\alpha` is the imbalance-correction exponent. Special cases: - :math:`\alpha = 1` → standard PWK (weights proportional to inverse class size). - :math:`\alpha \to \infty` → all weights equal to 1 (standard k-NN). **Why it exists:** A standard k-NN classifier biases predictions towards the majority class. PWK's weighting neutralises this bias, producing more accurate CC estimates under class imbalance. Barranquero et al. (2013) showed PWK outperforms standard CC + k-NN by a wide margin on imbalanced datasets. Parameters ---------- .. list-table:: :widths: 22 15 63 :header-rows: 1 * - Parameter - Default - Explanation * - ``alpha`` - ``1`` - Imbalance-correction exponent. Higher values reduce the penalty on larger classes: - ``alpha=1`` — standard PWK weighting (most aggressive correction). - ``alpha=2`` — gentler correction. - Very large ``alpha`` (e.g. 100) approaches standard k-NN. Tune by cross-validation if you are unsure; ``alpha=1`` is a good starting point. * - ``n_neighbors`` - ``10`` - Number of neighbours :math:`k`. Larger :math:`k` gives smoother (lower variance) prevalence estimates. Use larger values on bigger datasets. Common choices: 5, 10, 20. * - ``algorithm`` - ``'auto'`` - k-NN search algorithm. ``'auto'`` selects the fastest available (ball_tree/kd_tree for low-dim data, brute-force for high-dim). * - ``metric`` - ``'euclidean'`` - Distance metric for neighbour search. Use ``'euclidean'`` for continuous normalised features; ``'cosine'`` for sparse text/embedding vectors; ``'manhattan'`` for count data. * - ``leaf_size`` - ``30`` - Leaf size for tree-based algorithms. Affects speed, not accuracy. * - ``p`` - ``2`` - Minkowski distance parameter. ``p=2`` → Euclidean, ``p=1`` → Manhattan. * - ``metric_params`` - ``None`` - Extra keyword arguments for custom metric functions. * - ``n_jobs`` - ``None`` - Parallel jobs for neighbour search. ``-1`` uses all cores. Examples -------- Basic usage: .. code-block:: python from mlquantify.neighbors import PWK from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split X, y = make_classification(n_samples=1000, weights=[0.8, 0.2], random_state=42) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.3, random_state=42) q = PWK(alpha=1, n_neighbors=10) q.fit(X_train, y_train) print(q.predict(X_test)) # {0: 0.80, 1: 0.20} Tuning alpha and k: .. code-block:: python from mlquantify.model_selection import GridSearchQ, APP from mlquantify.metrics import MAE from mlquantify.neighbors import PWK protocol = APP(batch_size=100, n_prevalences=21, repeats=5) gs = GridSearchQ( quantifier=PWK(), param_grid={ 'alpha': [1, 2, 5], 'n_neighbors': [5, 10, 20], }, protocol=protocol, error=MAE, ) gs.fit(X_train, y_train) print(gs.best_params_) Getting per-instance classifications: .. code-block:: python q = PWK(alpha=1, n_neighbors=10) q.fit(X_train, y_train) labels = q.classify(X_test) # hard labels from the weighted k-NN print(labels[:10]) When to Use PWK --------------- - When you want a simple, non-parametric quantifier without tuning a full probabilistic classifier. - When feature space distances are meaningful (normalised continuous features). - For imbalanced binary problems where a standard classifier would be biased. **Limitation:** PWK applies CC on top of the k-NN classifier; it inherits CC's bias under distributional shift. It does not apply any adjustment like ACC or EMQ. For strong shift correction, use EMQ or DyS instead. References ========== .. dropdown:: References - Barranquero, J., Díez, J., & del Coz, J. J. (2013). On the study of nearest neighbor algorithms for prevalence estimation in binary problems. *Pattern Recognition*, 46(2), 472–482. .. seealso:: :ref:`counters_module` for CC (which PWK builds on top of). :ref:`likelihood` for EMQ (stronger shift correction).