.. _counters_module:

.. currentmodule:: mlquantify.adjust_counting


===========================
Counters For Quantification
===========================

To deal with problems of quantification, a straightforward approach is to count the number of items predicted to belong to each class in the unlabeled set. This is the basis of the **Classify and Count** family of methods.


Classify and Count  
==================

The **Classify and Count** method, or :class:`CC` is the simplest baseline.  
It trains a hard classifier :math:`h` on labeled data :math:`L` , applies it to an unlabeled set :math:`U` , and counts how many samples belong to each predicted class.


**Example**

.. code-block:: python

   from mlquantify.adjust_counting import CC
   from sklearn.linear_model import LogisticRegression
   import numpy as np

   X, y = np.random.randn(100, 5), np.random.randint(0, 2, 100)
   q = CC(learner=LogisticRegression())
   q.fit(X, y)
   q.predict(X)
   # -> {0: 0.47, 1: 0.53}


.. note::
    :class:`CC` is fast and simple, but when class proportions in the test set differ from the training set, its estimates can become biased or inaccurate.


Probabilistic Classify and Count  
================================

The **Probabilistic Classify and Count** or :class:`PCC` variant uses the *predicted probabilities* from a soft classifier instead of hard labels.  
This makes it less sensitive to uncertain predictions.

[Plot Idea: A plot comparing probabilities per sample and their averaged mean per class]

**Example**

.. code-block:: python

   from mlquantify.adjust_counting import PCC
   from sklearn.linear_model import LogisticRegression
   import numpy as np

   X, y = np.random.randn(100, 5), np.random.randint(0, 2, 100)
   q = PCC(learner=LogisticRegression())
   q.fit(X, y)
   q.predict(X)
   # -> {0: 0.45, 1: 0.55}

CC and PCC both often underestimate or overestimate the true prevalence when there is distribution shift (also known as "dataset shift").