Loading real-world datasets#
mlquantify.datasets ships 25 fetchers that download and cache well-known
quantification benchmarks — tabular, text, image, graph, a concept-drift stream,
and the LeQua 2024 competition. They follow the sklearn.datasets
conventions: every loader is a keyword-only fetch_<name>(...) that returns a
Bunch and caches the raw file so it is downloaded
only once.
Basic load#
from mlquantify.datasets import fetch_mushroom
data = fetch_mushroom() # downloads + caches on first call
data.data # feature matrix, shape (n_samples, n_features)
data.target # integer class labels
data.target_names # e.g. ['edible', 'poisonous']
scikit-learn-style outputs#
# A plain (X, y) tuple instead of a Bunch:
X, y = fetch_mushroom(return_X_y=True)
# pandas DataFrame / Series, with a combined .frame:
bunch = fetch_mushroom(as_frame=True)
bunch.frame.head()
Caching and offline use#
Every fetcher accepts the same caching controls:
# Where the raw files are cached (default: a _data/ dir next to the package).
data = fetch_mushroom(data_home="~/mlquantify_data")
# Fail instead of downloading when the cache is empty (e.g. offline / CI):
data = fetch_mushroom(download_if_missing=False)
Other datasets follow the same pattern — e.g.
fetch_dry_bean (multiclass tabular),
fetch_imdb (text) or
fetch_cifar10 (image). See the
Real-World Datasets guide for the full catalogue and each dataset’s extra
options.
See also
Real-World Datasets — the full list of fetchers and their options.
Evaluating a quantifier on real data — score a quantifier on a fetched dataset.