.. _counters_module:

.. currentmodule:: mlquantify.counting

===========================
Counting-Based Quantifiers
===========================

Counting-based quantifiers are the simplest family of aggregative methods:
train a classifier, apply it to the test set, and count the fraction of
instances assigned to each class. They are cheap, easy to understand, and
serve as essential baselines.

See :ref:`quantification_foundations` for the theoretical background on *why*
counting alone is biased and *when* you need a stronger correction.

.. contents:: Contents
   :local:
   :depth: 2

----

CC — Classify and Count
========================

:class:`CC` is the simplest quantification baseline. It trains a hard
classifier :math:`h` on labelled data :math:`L`, applies it to an unlabelled
set :math:`U`, and counts the proportion of predictions for each class:

.. math::

   \hat{p}^{CC}(c) = \frac{|\{x \in U : h(x) = c\}|}{|U|}

**Why it exists:** CC is the *reference baseline* for every quantification
study. Despite being biased under distributional shift (Forman, 2005), it is
fast, interpretable, and competitive when training and test prevalences are
close. Always include it as a point of comparison.

.. figure:: ../images/cc_bias.png
   :align: center
   :width: 75%
   :alt: CC and PCC prevalence estimation bias

   *A classifier trained on 50/50 balanced data is evaluated on test sets with
   varying true prevalences. CC (red) consistently overestimates at low prevalences
   and underestimates at high ones. PCC (orange) is less biased but still distorted.
   The dashed line is the ideal unbiased estimator.*

.. admonition:: When CC fails

   Suppose a classifier trained on balanced data (50 % positive) achieves
   90 % accuracy, but the test set has only 5 % positives. CC estimates
   :math:`\approx 14\%` positives — nearly 3× the truth — because the
   false positives from the 95 % negatives dominate the count. The bias
   does not vanish with more data; it vanishes only when the distributions
   match.

Parameters
----------

.. list-table::
   :widths: 20 15 65
   :header-rows: 1

   * - Parameter
     - Default
     - Explanation
   * - ``estimator``
     - ``None``
     - Any scikit-learn-compatible classifier with ``fit`` and ``predict``
       methods. If ``None``, skip fitting and call :meth:`aggregate` directly
       with pre-computed hard labels.
   * - ``threshold``
     - ``0.5``
     - Decision threshold applied to soft scores when the estimator exposes
       ``predict_proba``. Values above the threshold are labelled positive.
       Rarely needs tuning here — use :class:`TAC`, :class:`T50`, or
       :class:`TMAX` if threshold selection matters for your problem.

Examples
--------

Basic usage:

.. code-block:: python

   from mlquantify.counting import CC
   from sklearn.linear_model import LogisticRegression
   from sklearn.datasets import make_classification
   from sklearn.model_selection import train_test_split

   X, y = make_classification(n_samples=1000, weights=[0.8, 0.2],
                              random_state=42)
   X_train, X_test, y_train, y_test = train_test_split(
       X, y, test_size=0.3, random_state=42)

   q = CC(LogisticRegression())
   q.fit(X_train, y_train)
   print(q.predict(X_test))
   # {0: 0.82, 1: 0.18}

Using :meth:`aggregate` with pre-computed hard labels:

.. code-block:: python

   import numpy as np
   from mlquantify.counting import CC

   q = CC()
   hard_labels = np.array([0, 0, 1, 0, 1, 1, 0, 0, 0, 1])
   print(q.aggregate(hard_labels))
   # {0: 0.6, 1: 0.4}

.. warning::

   CC is consistently biased when the test class distribution differs from
   training. For any real deployment scenario with distribution shift, use
   :class:`ACC`, :class:`~mlquantify.likelihood.EMQ`, or a
   distribution-matching method instead.

----

PCC — Probabilistic Classify and Count
========================================

:class:`PCC` replaces hard decisions with **posterior probabilities** and
estimates prevalence as their average over the test set:

.. math::

   \hat{p}^{PCC}(c) = \frac{1}{|U|} \sum_{x \in U} P(Y = c \mid x)

where :math:`P(Y=c \mid x)` is the soft output of a probabilistic classifier.

**Why it exists:** PCC smooths out the hard-thresholding noise in CC. By
averaging probabilities instead of counting hard predictions, it reduces
variance and often reduces bias too — especially when the classifier is
well-calibrated. However, Forman (2005) showed that uncalibrated posteriors
under prior-probability shift are still biased, so PCC is not a complete
solution. It is the best quick upgrade from CC.

Parameters
----------

.. list-table::
   :widths: 20 15 65
   :header-rows: 1

   * - Parameter
     - Default
     - Explanation
   * - ``estimator``
     - ``None``
     - A probabilistic classifier with ``fit`` and ``predict_proba`` methods.
       Required for soft probability outputs. If ``None``, call
       :meth:`aggregate` directly with a pre-computed probability matrix of
       shape ``(n_samples, n_classes)``.

Examples
--------

.. code-block:: python

   from mlquantify.counting import PCC
   from sklearn.ensemble import RandomForestClassifier

   q = PCC(RandomForestClassifier(n_estimators=100, random_state=42))
   q.fit(X_train, y_train)
   print(q.predict(X_test))
   # {0: 0.80, 1: 0.20}

Using :meth:`aggregate` with pre-computed posteriors:

.. code-block:: python

   import numpy as np
   from mlquantify.counting import PCC

   # Shape: (n_samples, n_classes)
   proba = np.array([[0.9, 0.1],
                     [0.3, 0.7],
                     [0.6, 0.4]])
   q = PCC()
   print(q.aggregate(proba))
   # {0: 0.6, 1: 0.4}

----

GACC — Generalised Adjusted Classify and Count
================================================

:class:`GACC` extends the binary ACC correction to **native multiclass**
problems. During training, it builds a confusion matrix :math:`M` via
cross-validation (where :math:`M_{ij}` is the fraction of class-:math:`i`
samples predicted as class :math:`j`). At prediction time it solves the
constrained linear system:

.. math::

   \hat{p} = \arg\min_{\hat{p} \in \Delta^{k-1}} \| M\hat{p} - \hat{c} \|

where :math:`\hat{c}` is the vector of CC proportions on the test set.

**Why it exists:** Binary ACC cannot handle more than two classes directly,
because the 2×2 confusion-matrix correction becomes underdetermined for
:math:`k>2`. GACC provides the proper multiclass generalisation via
constrained optimisation, as introduced by Firat (2016).

Parameters
----------

.. list-table::
   :widths: 20 15 65
   :header-rows: 1

   * - Parameter
     - Default
     - Explanation
   * - ``estimator``
     - ``None``
     - Hard-prediction classifier (``fit`` + ``predict``). Does not need to be
       probabilistic.
   * - ``loss``
     - ``'ls'``
     - Loss used to solve the linear system. ``'ls'`` (least squares) is the
       standard choice. ``'l1'`` adds robustness against outlier confusion-matrix
       entries at the cost of a slower solve.
   * - ``solver``
     - ``'slsqp'``
     - Optimisation algorithm. ``'slsqp'`` handles the simplex constraint
       (:math:`\hat{p} \ge 0`, :math:`\sum \hat{p}=1`) natively and is the
       recommended choice.
   * - ``cv``
     - ``None`` (→ 5)
     - Cross-validation folds for building the confusion matrix. More folds give
       a more accurate estimate of the confusion matrix at the cost of extra
       training time. 5 is a good default; use 10 on larger datasets.
   * - ``stratified``
     - ``True``
     - Stratify folds to ensure every class appears in every fold — essential
       when classes are imbalanced. Leave ``True`` unless you have a specific
       reason not to.
   * - ``shuffle``
     - ``False``
     - Whether to shuffle data before splitting. Set ``True`` if the dataset has
       temporal ordering that could make consecutive samples very similar.
   * - ``random_state``
     - ``None``
     - Seed for reproducibility of the cross-validation split.

Examples
--------

Three-class quantification:

.. code-block:: python

   from mlquantify.counting import GACC
   from sklearn.svm import SVC
   from sklearn.datasets import make_classification

   X, y = make_classification(n_samples=600, n_classes=3,
                              n_informative=5, n_redundant=0,
                              random_state=42)
   X_train, X_test = X[:450], X[450:]
   y_train, y_test = y[:450], y[450:]

   q = GACC(SVC(), cv=5, stratified=True)
   q.fit(X_train, y_train)
   print(q.predict(X_test))
   # {0: 0.33, 1: 0.34, 2: 0.33}

.. note::

   GACC needs at least ``cv`` training samples per class. If a class is very
   rare, reduce ``cv`` or use stratified splits to prevent empty folds.

----

GPACC — Generalised Probabilistic Adjusted Classify and Count
==============================================================

:class:`GPACC` is the soft analogue of :class:`GACC`. Instead of a hard
confusion matrix, it builds a **soft confusion matrix** from posterior
probabilities:

.. math::

   M_{ij} = \frac{1}{|L_i|} \sum_{x \in L_i} P(Y = c_j \mid x)

where :math:`L_i` is the set of training examples with true class :math:`c_i`.
The prevalence is then estimated by solving the same constrained system as GACC.

**Why it exists:** Soft matrices preserve more information than hard ones (no
thresholding artefacts). GPACC is usually more accurate than GACC when the
classifier is well-calibrated and typically outperforms plain PCC. It is the
recommended native-multiclass counting method.

Parameters
----------

Same as :class:`GACC`. The ``estimator`` must support ``predict_proba``.

Examples
--------

.. code-block:: python

   from mlquantify.counting import GPACC
   from sklearn.linear_model import LogisticRegression
   from sklearn.datasets import make_classification

   X, y = make_classification(n_samples=600, n_classes=4,
                              n_informative=6, n_redundant=0,
                              random_state=42)
   X_train, X_test = X[:450], X[450:]
   y_train, y_test = y[:450], y[450:]

   q = GPACC(LogisticRegression(), cv=5)
   q.fit(X_train, y_train)
   print(q.predict(X_test))
   # {0: 0.25, 1: 0.26, 2: 0.24, 3: 0.25}

----

Choosing Among CC, PCC, GACC, and GPACC
==========================================

.. list-table::
   :widths: 18 15 18 20 14
   :header-rows: 1

   * - Method
     - Shift correction
     - Native multiclass
     - Needs ``predict_proba``
     - Extra fit cost
   * - CC
     - ✗
     - ✓
     - ✗
     - None
   * - PCC
     - ✗
     - ✓
     - ✓
     - None
   * - GACC
     - ✓ (linear)
     - ✓
     - ✗
     - CV folds
   * - GPACC
     - ✓ (linear)
     - ✓
     - ✓
     - CV folds

**Practical recommendation:**

- Use **CC** only as a sanity-check baseline.
- Use **GPACC** as your primary counting-family method: the soft matrix
  and simplex constraint make it noticeably better than CC/PCC under shift.
- If probabilities are unavailable, fall back to **GACC**.
- For large distributional shifts, graduate to :class:`~mlquantify.likelihood.EMQ`
  or a distribution-matching method from :mod:`mlquantify.matching`.

References
==========

.. dropdown:: References

   - Forman, G. (2005). Counting Positives Accurately Despite Inaccurate
     Classification. *ECML*, 564–575.
   - Bella, A., Ferri, C., Hernández-Orallo, J., & Ramírez-Quintana, M. J.
     (2010). Quantification via Probability Estimators. *ICDM*, 737–742.
   - Firat, A. (2016). Unified Framework for Quantification.
     *arXiv:1606.00868*.

.. seealso::

   :ref:`counting` for threshold-adjustment methods (:class:`ACC`,
   :class:`TAC`, :class:`TX`, …) which offer stronger binary-specific correction.