.. _sphx_protocols:

====================================
Evaluation protocols (APP, NPP, UPP)
====================================

A single test set cannot tell you whether a quantifier is robust to prior
shift, because it has only one prevalence. **Protocols** solve this by carving
many evaluation samples — each with a controlled class distribution — out of one
dataset. ``mlquantify`` ships three:

- :class:`~mlquantify.model_selection.APP` — *Artificial Prevalence Protocol*:
  prevalences laid out on a regular grid (exhaustive, the classic choice).
- :class:`~mlquantify.model_selection.UPP` — *Uniform Prevalence Protocol*:
  prevalences drawn uniformly from the probability simplex (scales to
  multiclass without a grid blow-up).
- :class:`~mlquantify.model_selection.NPP` — *Natural Prevalence Protocol*:
  plain random sampling, preserving the dataset's natural prevalence.

The plot below draws the actual positive-class prevalence of every sample each
protocol generates, so you can see the distribution each one explores.

.. plot::

    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.datasets import make_classification

    from mlquantify.model_selection import APP, NPP, UPP

    X, y = make_classification(
        n_samples=4000, n_features=10, weights=[0.6, 0.4], random_state=0,
    )

    protocols = {
        "APP (grid)": APP(batch_size=200, n_prevalences=21, repeats=3,
                          random_state=0),
        "UPP (uniform simplex)": UPP(batch_size=200, n_prevalences=60,
                                     repeats=1, random_state=0),
        "NPP (natural)": NPP(batch_size=200, n_samples=60, random_state=0),
    }

    fig, axes = plt.subplots(3, 1, figsize=(7, 6), sharex=True)
    for (name, proto), ax, color in zip(
        protocols.items(), axes, ["#264653", "#2a9d8f", "#e76f51"],
    ):
        prevs = [y[idx].mean() for idx in proto.split(X, y)]
        ax.hist(prevs, bins=np.linspace(0, 1, 26), color=color, alpha=0.85)
        ax.set_ylabel("count")
        ax.set_title(f"{name}  —  {len(prevs)} samples", fontsize="medium")
    axes[-1].set_xlabel("Positive-class prevalence of the generated sample")
    fig.tight_layout()

APP covers the range evenly (great for stress-testing a quantifier everywhere),
UPP spreads samples randomly but still across the whole range, while NPP
clusters tightly around the dataset's natural 0.4 — realistic, but blind to
shift. Use APP/UPP to *evaluate robustness* and NPP to *estimate deployment
error* when the test distribution is expected to match the data at hand.

.. seealso::

   - :func:`~mlquantify.model_selection.apply_protocol` — run a protocol and
     collect true/predicted prevalences in one call.
   - :ref:`sphx_error_by_shift` — turn an APP run into a robustness curve.