Loading real-world datasets#

mlquantify.datasets ships 25 fetchers that download and cache well-known quantification benchmarks — tabular, text, image, graph, a concept-drift stream, and the LeQua 2024 competition. They follow the sklearn.datasets conventions: every loader is a keyword-only fetch_<name>(...) that returns a Bunch and caches the raw file so it is downloaded only once.

Basic load#

from mlquantify.datasets import fetch_mushroom

data = fetch_mushroom()      # downloads + caches on first call
data.data                    # feature matrix, shape (n_samples, n_features)
data.target                  # integer class labels
data.target_names            # e.g. ['edible', 'poisonous']

scikit-learn-style outputs#

# A plain (X, y) tuple instead of a Bunch:
X, y = fetch_mushroom(return_X_y=True)

# pandas DataFrame / Series, with a combined .frame:
bunch = fetch_mushroom(as_frame=True)
bunch.frame.head()

Caching and offline use#

Every fetcher accepts the same caching controls:

# Where the raw files are cached (default: a _data/ dir next to the package).
data = fetch_mushroom(data_home="~/mlquantify_data")

# Fail instead of downloading when the cache is empty (e.g. offline / CI):
data = fetch_mushroom(download_if_missing=False)

Other datasets follow the same pattern — e.g. fetch_dry_bean (multiclass tabular), fetch_imdb (text) or fetch_cifar10 (image). See the Real-World Datasets guide for the full catalogue and each dataset’s extra options.

See also