7. Real-World Datasets#
Alongside the synthetic generator (Synthetic Datasets), mlquantify.datasets
ships 25 dataset fetchers that download and cache well-known quantification
benchmarks. They follow the same conventions as sklearn.datasets: every
loader is a keyword-only fetch_<name>(...) function that returns a
Bunch (or a plain (X, y) tuple), caching the raw file under a local
data_home so it is downloaded only once.
What sets them apart from the scikit-learn loaders is the optional
quantification protocol: pass protocol="app" (or another protocol) and the
returned Bunch additionally carries .samples and .prevalences —
a collection of test bags with known class prevalence, ready to score a
quantifier against.
from mlquantify.datasets import fetch_mushroom
b = fetch_mushroom() # plain load -> Bunch(data, target, ...)
X, y = fetch_mushroom(return_X_y=True) # sklearn-style tuple
# quantification mode: 1000 bags of 500 instances drawn with the
# Artificial Prevalence Protocol
b = fetch_mushroom(protocol="app", n_samples=1000, sample_size=500, random_state=0)
b.samples # list of index arrays into b.data, one per bag
b.prevalences # (n_samples, n_classes) array of TRUE prevalences
7.2. Per-dataset configuration#
Each dataset adds a few of its own keyword arguments on top of the shared ones
(shown in the Fetcher (key params) column). target_col= lets you override
the column used as the label for the tabular CSV loaders.
7.2.1. Binary (tabular)#
Dataset |
Fetcher (key params) |
Task / classes |
Source |
|---|---|---|---|
Mushroom |
|
Binary (2) · easy |
UCI #73 |
Banknote Authentication |
|
Binary (2) · easy |
UCI #267 |
Haberman’s Survival |
|
Binary (2) · hard |
UCI #43 |
Pima Indians Diabetes |
|
Binary (2) · hard |
jbrownlee mirror |
MiniBooNE Particle ID |
|
Binary (2) · med-hard |
UCI #199 |
7.2.2. Multiclass (tabular)#
Dataset |
Fetcher (key params) |
Task / classes |
Source |
|---|---|---|---|
Optical / Pen-Based Digits |
|
Multiclass (10) · easy |
UCI #80 / #81 |
Dry Bean |
|
Multiclass (7) · easy-med |
UCI #602 |
Covertype |
|
Multiclass (7) · hard |
UCI #31 |
Yeast |
|
Multiclass (10) · hard |
UCI #110 |
Sensorless Drive Diagnosis |
|
Multiclass (11) · med-hard |
UCI #325 |
Statlog (Shuttle) |
|
Multiclass (7) · hard (imbalanced) |
UCI #148 |
7.2.3. Ordinal#
Dataset |
Fetcher (key params) |
Task / classes |
Source |
|---|---|---|---|
Wine Quality |
|
Ordinal (quality 3–9) |
UCI #186 |
LeQua 2024 |
|
T1 binary · T2 multiclass-28 · T3 ordinal · T4 binary (covariate) |
Zenodo |
7.2.4. Text#
Dataset |
Fetcher (key params) |
Task / classes |
Source |
|---|---|---|---|
RCV1-v2 |
|
Multilabel topic (sparse) |
via scikit-learn |
20 Newsgroups |
|
Multiclass (20) |
figshare |
IMDB |
|
Binary (2) |
Stanford |
Multi-Domain Sentiment |
|
Binary (2) · covariate / domain shift |
JHU (Blitzer) |
7.2.5. Temporal / time-shift#
Dataset |
Fetcher (key params) |
Task / classes |
Source |
|---|---|---|---|
Sentiment140 |
|
Binary (2) · temporal |
Stanford |
Electricity (Elec2) |
|
Binary (2) · concept drift |
scikit-multiflow |
Airlines (flight delay) |
|
Binary (2) · concept drift |
scikit-multiflow |
Online News Popularity |
|
Binary (popularity thresholded at |
UCI #332 |
7.2.6. Image, graph & synthetic (other shift)#
Dataset |
Fetcher (key params) |
Task / classes |
Source |
|---|---|---|---|
SEA Concepts |
|
Binary (2) · concept drift (generated, no download) |
synthetic |
MNIST → USPS |
|
Multiclass (10) · covariate shift |
LIBSVM |
CIFAR-10 |
|
Multiclass (10) · label / prior shift |
Toronto |
Planetoid (Cora / CiteSeer / PubMed) |
|
Multiclass (3–7) · graph, covariate / structural shift |
GitHub (Planetoid) |
7.3. Optional download progress#
The fetchers print nothing by default and pull in no progress dependency. To
display download progress, register any callable with set_progress_hook;
every fetcher will then report its downloads to it as hook(downloaded, total, url)
(total is None when the server sends no Content-Length). This keeps the
choice of progress tool — plain print, logging or tqdm — entirely yours:
from mlquantify.datasets import set_progress_hook
from tqdm import tqdm # your dependency, not mlquantify's
bars = {}
def show(downloaded, total, url):
bar = bars.setdefault(url, tqdm(total=total, unit="B", unit_scale=True))
bar.update(downloaded - bar.n)
set_progress_hook(show) # all fetchers now report progress
7.4. References#
The table below lists, for each dataset, the quantification works in the literature that use it (numbers refer to the reference list underneath).
Dataset |
Used by — quantification papers |
|---|---|
Mushroom |
|
Banknote Authentication |
— (new addition) |
Haberman’s Survival |
|
Pima Indians Diabetes |
|
MiniBooNE Particle ID |
|
Optical / Pen-Based Digits |
|
Dry Bean |
— (new addition) |
Covertype |
|
Yeast |
|
Sensorless Drive Diagnosis |
— (new addition) |
Statlog (Shuttle) |
|
Wine Quality |
|
LeQua 2024 (T1–T4) |
|
RCV1-v2 |
|
20 Newsgroups |
|
IMDB |
|
Multi-Domain Sentiment |
— (new addition) |
Sentiment140 |
|
Electricity (Elec2) |
— (new addition) |
Airlines |
— (new addition) |
Online News Popularity |
|
SEA Concepts |
— (new addition) |
MNIST → USPS |
|
CIFAR-10 |
|
Planetoid (Cora / CiteSeer / PubMed) |
Datasets marked new addition are recent additions to the library that are not yet used by a quantification work in the surveyed bibliography; see the Source column above for each dataset’s original reference.
See also
Synthetic Datasets — the
make_quantificationgenerator for fully controlled prior / covariate / concept shift.mlquantify.datasets— the full API reference for every fetcher.