7. Real-World Datasets#

Alongside the synthetic generator (Synthetic Datasets), mlquantify.datasets ships 25 dataset fetchers that download and cache well-known quantification benchmarks. They follow the same conventions as sklearn.datasets: every loader is a keyword-only fetch_<name>(...) function that returns a Bunch (or a plain (X, y) tuple), caching the raw file under a local data_home so it is downloaded only once.

What sets them apart from the scikit-learn loaders is the optional quantification protocol: pass protocol="app" (or another protocol) and the returned Bunch additionally carries .samples and .prevalences — a collection of test bags with known class prevalence, ready to score a quantifier against.

from mlquantify.datasets import fetch_mushroom

b = fetch_mushroom()                       # plain load -> Bunch(data, target, ...)
X, y = fetch_mushroom(return_X_y=True)     # sklearn-style tuple

# quantification mode: 1000 bags of 500 instances drawn with the
# Artificial Prevalence Protocol
b = fetch_mushroom(protocol="app", n_samples=1000, sample_size=500, random_state=0)
b.samples        # list of index arrays into b.data, one per bag
b.prevalences    # (n_samples, n_classes) array of TRUE prevalences

7.1. Shared configuration#

Every fetcher accepts the same base keyword arguments:

Parameter	Meaning
`data_home=None`	Folder used to cache the downloaded file(s). Defaults to a `_data/` directory next to the package.
`download_if_missing=True`	If `False`, raise instead of downloading when the cache is empty.
`return_X_y=False`	Return `(X, y)` instead of a `Bunch`.
`as_frame=False`	Return `.data` as a `pandas.DataFrame` and `.target` as a `pandas.Series` (with a combined `.frame`), where applicable.
`n_retries=3`	Number of download attempts before giving up.
`delay=1.0`	Seconds to wait between attempts.
`protocol=None`	Quantification sampling protocol: `None` (no bags), `"app"`, `"npp"`, `"upp"`, `"ppp"`, or an mlquantify protocol instance. When set, the `Bunch` also exposes `.samples`, `.prevalences` and `.protocol`.
`n_samples=1000`	Number of prevalence points (bags) the protocol generates.
`sample_size=500`	Instances per bag.
`random_state=None`	Seed forwarded to the protocol.

The protocol / n_samples / sample_size triple is the quantification-specific part of the API; see make_protocol and Synthetic Datasets for how protocols turn a labelled table into evaluation bags.

7.2. Per-dataset configuration#

Each dataset adds a few of its own keyword arguments on top of the shared ones (shown in the Fetcher (key params) column). target_col= lets you override the column used as the label for the tabular CSV loaders.

7.2.1. Binary (tabular)#

Dataset	Fetcher (key params)	Task / classes	Source
Mushroom	`fetch_mushroom(target_col=None)`	Binary (2) · easy	UCI #73
Banknote Authentication	`fetch_banknote_authentication(target_col=None)`	Binary (2) · easy	UCI #267
Haberman’s Survival	`fetch_haberman_survival(target_col=None)`	Binary (2) · hard	UCI #43
Pima Indians Diabetes	`fetch_pima_diabetes(target_col=None)`	Binary (2) · hard	jbrownlee mirror
MiniBooNE Particle ID	`fetch_miniboone()`	Binary (2) · med-hard	UCI #199

7.2.2. Multiclass (tabular)#

Dataset	Fetcher (key params)	Task / classes	Source
Optical / Pen-Based Digits	`fetch_digits_optical_penbased(which='optical'\|'penbased', target_col=None)`	Multiclass (10) · easy	UCI #80 / #81
Dry Bean	`fetch_dry_bean(target_col=None)`	Multiclass (7) · easy-med	UCI #602
Covertype	`fetch_covertype(target_col=None)`	Multiclass (7) · hard	UCI #31
Yeast	`fetch_yeast(target_col=None)`	Multiclass (10) · hard	UCI #110
Sensorless Drive Diagnosis	`fetch_sensorless_drive(target_col=None)`	Multiclass (11) · med-hard	UCI #325
Statlog (Shuttle)	`fetch_statlog_shuttle(target_col=None)`	Multiclass (7) · hard (imbalanced)	UCI #148

7.2.3. Ordinal#

Dataset	Fetcher (key params)	Task / classes	Source
Wine Quality	`fetch_wine_quality(target_col=None)`	Ordinal (quality 3–9)	UCI #186
LeQua 2024	`fetch_lequa2024(task='T1'\|'T2'\|'T3'\|'T4', include_test=False)`	T1 binary · T2 multiclass-28 · T3 ordinal · T4 binary (covariate)	Zenodo

7.2.4. Text#

Dataset	Fetcher (key params)	Task / classes	Source
RCV1-v2	`fetch_rcv1_v2()`	Multilabel topic (sparse)	via scikit-learn
20 Newsgroups	`fetch_newsgroups20(subset='train'\|'test')`	Multiclass (20)	figshare
IMDB	`fetch_imdb(subset='train'\|'test')`	Binary (2)	Stanford
Multi-Domain Sentiment	`fetch_multidomain_sentiment(domain='books'\|'dvd'\|'electronics'\|'kitchen')`	Binary (2) · covariate / domain shift	JHU (Blitzer)

7.2.5. Temporal / time-shift#

Dataset	Fetcher (key params)	Task / classes	Source
Sentiment140	`fetch_sentiment140()`	Binary (2) · temporal	Stanford
Electricity (Elec2)	`fetch_electricity_elec2(target_col=None)`	Binary (2) · concept drift	scikit-multiflow
Airlines (flight delay)	`fetch_airlines(target_col=None)`	Binary (2) · concept drift	scikit-multiflow
Online News Popularity	`fetch_online_news_popularity(threshold=1400)`	Binary (popularity thresholded at `threshold`)	UCI #332

7.2.6. Image, graph & synthetic (other shift)#

Dataset	Fetcher (key params)	Task / classes	Source
SEA Concepts	`fetch_sea_concepts(n=50000, drift=True, noise=0.1)`	Binary (2) · concept drift (generated, no download)	synthetic
MNIST → USPS	`fetch_mnist_usps(domain='mnist'\|'usps', subset='train'\|'test'\|'all')`	Multiclass (10) · covariate shift	LIBSVM
CIFAR-10	`fetch_cifar10(subset='train'\|'test'\|'all')`	Multiclass (10) · label / prior shift	Toronto
Planetoid (Cora / CiteSeer / PubMed)	`fetch_planetoid_cora_citeseer_pubmed(name='cora'\|'citeseer'\|'pubmed')`	Multiclass (3–7) · graph, covariate / structural shift	GitHub (Planetoid)

7.3. Optional download progress#

The fetchers print nothing by default and pull in no progress dependency. To display download progress, register any callable with set_progress_hook; every fetcher will then report its downloads to it as hook(downloaded, total, url) (total is None when the server sends no Content-Length). This keeps the choice of progress tool — plain print, logging or tqdm — entirely yours:

from mlquantify.datasets import set_progress_hook
from tqdm import tqdm          # your dependency, not mlquantify's

bars = {}
def show(downloaded, total, url):
    bar = bars.setdefault(url, tqdm(total=total, unit="B", unit_scale=True))
    bar.update(downloaded - bar.n)

set_progress_hook(show)        # all fetchers now report progress

7.4. References#

The table below lists, for each dataset, the quantification works in the literature that use it (numbers refer to the reference list underneath).

Dataset	Used by — quantification papers
Mushroom	[1] [2] [3]
Banknote Authentication	— (new addition)
Haberman’s Survival	[1] [4] [5] [6]
Pima Indians Diabetes	[1] [5] [6] [7] [8]
MiniBooNE Particle ID	[9]
Optical / Pen-Based Digits	[10] [11]
Dry Bean	— (new addition)
Covertype	[2] [6] [10] [12] [13]
Yeast	[5] [6] [7] [14]
Sensorless Drive Diagnosis	— (new addition)
Statlog (Shuttle)	[8] [9]
Wine Quality	[2] [6] [7] [11] [14] [15]
LeQua 2024 (T1–T4)	[16] [17]
RCV1-v2	[18] [19]
20 Newsgroups	[19]
IMDB	[19] [20]
Multi-Domain Sentiment	— (new addition)
Sentiment140	[21]
Electricity (Elec2)	— (new addition)
Airlines	— (new addition)
Online News Popularity	[21]
SEA Concepts	— (new addition)
MNIST → USPS	[22]
CIFAR-10	[22] [23]
Planetoid (Cora / CiteSeer / PubMed)	[24]

Datasets marked new addition are recent additions to the library that are not yet used by a quantification work in the surveyed bibliography; see the Source column above for each dataset’s original reference.