.. _sphx_synthetic_prevalence:

==============================================
Controlling prevalence variability across bags
==============================================

The ``prevalence`` argument (and, for the Dirichlet, ``concentration``) decides
*how* each bag's class balance is drawn — i.e. how much the prevalence varies
from bag to bag. This is the knob that turns
:func:`~mlquantify.datasets.make_quantification` into the different evaluation
protocols. The figure draws the distribution of the positive-class prevalence
over many bags for each strategy.

.. plot::

    import numpy as np
    import matplotlib.pyplot as plt

    from mlquantify.datasets import make_quantification

    def positive_prevalences(**kwargs):
        _, _, prevs = make_quantification(
            n_batches=400, batch_size=200, n_features=2, n_redundant=0,
            random_state=0, **kwargs
        )
        return prevs[:, 1]

    panels = [
        ("uniform — full range of shifts",
         dict(prevalence="uniform")),
        ("grid — even sweep (APP)",
         dict(prevalence="grid", n_prevalences=21)),
        ("natural — around weights=[0.6, 0.4]",
         dict(prevalence="natural", weights=[0.6, 0.4])),
        ("dirichlet near 0.7, concentration=200 (tight)",
         dict(prevalence="dirichlet", target_prevalence=[0.3, 0.7], concentration=200)),
        ("dirichlet near 0.7, concentration=8 (loose)",
         dict(prevalence="dirichlet", target_prevalence=[0.3, 0.7], concentration=8)),
    ]
    colors = ["#264653", "#2a9d8f", "#e9c46a", "#e76f51", "#9b5de5"]

    fig, axes = plt.subplots(len(panels), 1, figsize=(7, 8.5), sharex=True)
    bins = np.linspace(0, 1, 31)
    for ax, (title, kwargs), color in zip(axes, panels, colors):
        ax.hist(positive_prevalences(**kwargs), bins=bins, color=color, alpha=0.85)
        ax.set_title(title, fontsize="small")
        ax.set_yticks([])
    axes[-1].set_xlabel("Positive-class prevalence of the bag")
    fig.suptitle("How `prevalence` and `concentration` shape bag-to-bag variability")
    fig.tight_layout()

Reading top to bottom: ``"uniform"`` spreads bags across the whole range (maximum
shift), ``"grid"`` places them on regular points, ``"natural"`` clusters them
tightly around the population prior, and the two Dirichlet panels show the same
**target** of 0.70 with very different spreads — high ``concentration`` pins bags
to the target, low ``concentration`` lets them wander. That single parameter is
how you go from "test everywhere" to "test near a realistic operating point".

.. seealso::

   - :func:`~mlquantify.datasets.make_quantification` — the full parameter list.
   - :ref:`sphx_protocols` — the same idea expressed as evaluation protocols.