.. _sphx_synthetic_difficulty:

==================================
Class separability and label noise
==================================

How hard is the quantification problem? Two ``make_quantification`` knobs
control that directly: ``class_sep`` (how far apart the class clusters sit) and
``flip_y`` (how much label noise is injected). They set the ceiling on how well
*any* method can do, so they are worth understanding before comparing
quantifiers.

First, what ``class_sep`` does to the feature space:

.. plot::

    import matplotlib.pyplot as plt

    from mlquantify.datasets import make_quantification

    fig, axes = plt.subplots(1, 3, figsize=(13, 4), sharex=True, sharey=True)
    for ax, sep in zip(axes, [0.4, 1.0, 2.5]):
        Xs, ys, _ = make_quantification(
            n_batches=1, batch_size=700, n_features=2, n_redundant=0,
            prevalence=[[0.5, 0.5]], class_sep=sep, random_state=0,
        )
        X, y = Xs[0], ys[0]
        for k, color in enumerate(["#2a9d8f", "#e76f51"]):
            mask = y == k
            ax.scatter(X[mask, 0], X[mask, 1], s=10, alpha=0.6, color=color)
        ax.set_title(f"class_sep = {sep}")
        ax.set_xticks([])
        ax.set_yticks([])
    fig.suptitle("Low separability (left) is a hard quantification problem", y=1.02)
    fig.tight_layout()

The same picture for **three classes** — each cluster keeps its own colour, and
``make_quantification`` handles the extra class with no change to the call
(``n_classes=3``):

.. plot::

    import matplotlib.pyplot as plt

    from mlquantify.datasets import make_quantification

    fig, axes = plt.subplots(1, 3, figsize=(13, 4), sharex=True, sharey=True)
    for ax, sep in zip(axes, [0.4, 1.0, 2.5]):
        Xs, ys, _ = make_quantification(
            n_batches=1, batch_size=900, n_classes=3, n_features=2,
            n_redundant=0, prevalence=[[1 / 3, 1 / 3, 1 / 3]],
            class_sep=sep, random_state=3,
        )
        X, y = Xs[0], ys[0]
        for k, color in enumerate(["#2a9d8f", "#e76f51", "#4361ee"]):
            mask = y == k
            ax.scatter(X[mask, 0], X[mask, 1], s=10, alpha=0.6,
                       color=color, label=f"class {k}")
        ax.set_title(f"class_sep = {sep}")
        ax.set_xticks([])
        ax.set_yticks([])
    axes[0].legend(loc="best", fontsize="small")
    fig.suptitle("Three classes: the clusters separate as class_sep grows", y=1.02)
    fig.tight_layout()

.. note::

   ``make_classification`` (which ``make_quantification`` builds on) draws each
   class with its *own* random covariance, so some clusters come out rounder and
   others more elongated. With only two features that shape is fully visible; in
   a higher-dimensional run it is spread across the extra dimensions. The bags
   here also use a balanced prevalence, because a single ``prevalence="uniform"``
   draw can be strongly imbalanced and leave one class with very few points.

Now the consequence: sweeping ``class_sep`` and measuring the mean absolute
error of three quantifiers over a uniform set of bags. As the classes separate,
the underlying classifier sharpens and every method improves — but the
adjusted methods stay ahead when the problem is hard.

.. plot::

    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.linear_model import LogisticRegression

    from mlquantify import set_config
    from mlquantify.datasets import make_quantification
    from mlquantify.counting import CC, ACC
    from mlquantify.likelihood import EMQ

    set_config(prevalence_return_type="array")   # predictions come back as arrays

    seps = [0.4, 0.7, 1.0, 1.5, 2.5]
    methods = {"CC": CC, "ACC": ACC, "EMQ": EMQ}
    curves = {name: [] for name in methods}

    for sep in seps:
        Xtr, ytr, Xs, ys, prevs = make_quantification(
            n_batches=120, batch_size=200, return_train=True,
            train_prevalence=[0.5, 0.5], prevalence="uniform",
            n_features=2, n_redundant=0, class_sep=sep, random_state=0,
        )
        for name, Method in methods.items():
            q = Method(LogisticRegression(max_iter=1000)).fit(Xtr, ytr)
            pred = np.vstack([q.predict(Xb) for Xb in Xs])
            curves[name].append(float(np.mean(np.abs(pred - prevs))))

    fig, ax = plt.subplots(figsize=(7, 4.5))
    for (name, vals), color in zip(curves.items(), ["#e76f51", "#2a9d8f", "#264653"]):
        ax.plot(seps, vals, "o-", color=color, label=name)
    ax.set_xlabel("class_sep  (higher = easier)")
    ax.set_ylabel("Mean absolute error over uniform bags")
    ax.set_title("Quantification error vs. class separability")
    ax.legend()
    fig.tight_layout()

This is the controlled setting that makes ``make_quantification`` useful: you can
hold everything fixed and vary one source of difficulty at a time — separability
here, label noise via ``flip_y``, or sample size via ``batch_size``.

.. seealso::

   - :ref:`sphx_synthetic_quantifiers` — the per-bag diagonal view of the same
     methods.