.. _getting_started: Getting Started =============== **mlquantify** is a comprehensive Python toolkit for **Quantification** (also known as *Class Prevalence Estimation*, *Class Prior Estimation*, or *Shift Estimation*). Installation ------------ You can install ``mlquantify`` using pip: .. code-block:: bash pip install mlquantify Or install the latest development version from source: .. code-block:: bash git clone https://github.com/luizfernandolj/mlquantify.git cd mlquantify pip install . Basic Usage ----------- Most quantifiers in ``mlquantify`` behave like scikit-learn estimators. They implement ``fit(X, y)`` and ``predict(X)`` methods. .. code-block:: python from sklearn.linear_model import LogisticRegression from mlquantify.counting import CC # Classify & Count # 1. Initialize a base classifier estimator = LogisticRegression() # 2. Wrap it with an aggregative quantifier (e.g., CC) quantifier = CC(estimator) # 3. Fit on labeled training data quantifier.fit(X_train, y_train) # 4. Predict class prevalences on new data prevalences = quantifier.predict(X_test) print(prevalences) The ``fit`` Parameters ---------------------- All aggregative quantifiers in ``mlquantify`` support a consistent set of parameters in their ``fit`` method to control how the underlying classifier is trained or used: * ``X``: The training input samples (array-like, sparse matrix). * ``y``: The target values (class labels). * ``estimator_fitted`` (bool): If ``True``, assumes the provided estimator is already trained. If ``False`` (default), trains the estimator on the provided ``X`` and ``y``. * ``cv`` (int, cross-validation generator, or iterable): Determines the cross-validation splitting strategy for generating internal predictions (used by methods like ACC, PACC). * ``stratified`` (bool): If ``True``, uses stratified folds for cross-validation. * ``shuffle`` (bool): Whether to shuffle the data before splitting in cross-validation. Aggregative Quantifiers & ``aggregate`` --------------------------------------- Aggregative methods (like CC, ACC, PCC) estimate prevalence by aggregating predictions from individual instances. Unlike standard estimators, they offer an additional **``aggregate``** method. This allows you to perform quantification **without re-predicting** if you already have the classifier's outputs (labels or probabilities) for your test set. .. code-block:: python # Assume we already have predictions for the test set predictions = classifier.predict(X_test) # Use 'aggregate' directly - no need for X_test estimated_prevalence = quantifier.aggregate(predictions) Model evaluation ---------------- Fitting a model to some data does not entail that it will predict well on unseen data. This needs to be directly evaluated. We typically use a ``train_test_split`` to split a dataset into train and test sets, and then use specific metrics to compare the predicted prevalences against the true prevalences. ``mlquantify`` provides many tools for model evaluation in the :mod:`mlquantify.metrics` module. .. code-block:: python from sklearn.datasets import make_classification from sklearn.model_selection import train_test_split from mlquantify.counting import CC from mlquantify.metrics import MAE from sklearn.linear_model import LogisticRegression # Generate synthetic data X, y = make_classification(n_samples=1000, weights=[0.8, 0.2], random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42) # Initialize and fit quantifier = CC(LogisticRegression()) quantifier.fit(X_train, y_train) # Predict prevalences y_pred = quantifier.predict(X_test) # Calculate Mean Absolute Error (MAE) # y_test contains true labels; we convert them to prevalences for comparison error = MAE(y_test, y_pred) print(f"Mean Absolute Error: {error:.4f}") Quantification Protocols ~~~~~~~~~~~~~~~~~~~~~~~~ In quantification, a single test set is often insufficient because we want to evaluate performance across *different* class distributions (shifts). **Protocols** like the **Artificial Prevalence Protocol (APP)** allow you to generate many test samples with varying prevalences from a single dataset. .. code-block:: python from mlquantify.model_selection import APP from mlquantify.utils import get_prev_from_labels # Create an APP generator: # - n_prevalences=21: Generate samples with prevalences from 0.0 to 1.0 (step 0.05) # - repeats=10: Generate 10 difference samples for each prevalence protocol = APP(batch_size=100, n_prevalences=21, repeats=10, random_state=42) errors = [] # APP.split() yields indices for each test sample for test_index in protocol.split(X_test, y_test): X_sample, y_sample = X_test[test_index], y_test[test_index] # Predict prevalence on this specific sample pred_prev = quantifier.predict(X_sample) # Calculate error for this sample errors.append(MAE(y_sample, pred_prev)) print(f"Mean Absolute Error across {len(errors)} samples: {sum(errors)/len(errors):.4f}") See :ref:`quantification_protocols` for more details on APP, NPP, and other protocols. Next steps ---------- We have briefly covered estimator fitting and predicting, aggregative methods, and model evaluation. This guide should give you an overview of some of the main features of the library, but there is much more to ``mlquantify``! Please refer to our :ref:`user_guide` for details on all the tools that we provide, including **Non-Aggregative Methods**, **Meta Quantification**, and **Confidence Intervals**. You can also find an exhaustive list of the public API in the :ref:`api_ref`. You can also look at our numerous :ref:`examples` that illustrate the use of ``mlquantify`` in many different contexts.