7. Real-World Datasets#

Alongside the synthetic generator (Synthetic Datasets), mlquantify.datasets ships 25 dataset fetchers that download and cache well-known quantification benchmarks. They follow the same conventions as sklearn.datasets: every loader is a keyword-only fetch_<name>(...) function that returns a Bunch (or a plain (X, y) tuple), caching the raw file under a local data_home so it is downloaded only once.

What sets them apart from the scikit-learn loaders is the optional quantification protocol: pass protocol="app" (or another protocol) and the returned Bunch additionally carries .samples and .prevalences — a collection of test bags with known class prevalence, ready to score a quantifier against.

from mlquantify.datasets import fetch_mushroom

b = fetch_mushroom()                       # plain load -> Bunch(data, target, ...)
X, y = fetch_mushroom(return_X_y=True)     # sklearn-style tuple

# quantification mode: 1000 bags of 500 instances drawn with the
# Artificial Prevalence Protocol
b = fetch_mushroom(protocol="app", n_samples=1000, sample_size=500, random_state=0)
b.samples        # list of index arrays into b.data, one per bag
b.prevalences    # (n_samples, n_classes) array of TRUE prevalences

7.1. Shared configuration#

Every fetcher accepts the same base keyword arguments:

Parameter

Meaning

data_home=None

Folder used to cache the downloaded file(s). Defaults to a _data/ directory next to the package.

download_if_missing=True

If False, raise instead of downloading when the cache is empty.

return_X_y=False

Return (X, y) instead of a Bunch.

as_frame=False

Return .data as a pandas.DataFrame and .target as a pandas.Series (with a combined .frame), where applicable.

n_retries=3

Number of download attempts before giving up.

delay=1.0

Seconds to wait between attempts.

protocol=None

Quantification sampling protocol: None (no bags), "app", "npp", "upp", "ppp", or an mlquantify protocol instance. When set, the Bunch also exposes .samples, .prevalences and .protocol.

n_samples=1000

Number of prevalence points (bags) the protocol generates.

sample_size=500

Instances per bag.

random_state=None

Seed forwarded to the protocol.

The protocol / n_samples / sample_size triple is the quantification-specific part of the API; see make_protocol and Synthetic Datasets for how protocols turn a labelled table into evaluation bags.

7.2. Per-dataset configuration#

Each dataset adds a few of its own keyword arguments on top of the shared ones (shown in the Fetcher (key params) column). target_col= lets you override the column used as the label for the tabular CSV loaders.

7.2.1. Binary (tabular)#

Dataset

Fetcher (key params)

Task / classes

Source

Mushroom

fetch_mushroom(target_col=None)

Binary (2) · easy

UCI #73

Banknote Authentication

fetch_banknote_authentication(target_col=None)

Binary (2) · easy

UCI #267

Haberman’s Survival

fetch_haberman_survival(target_col=None)

Binary (2) · hard

UCI #43

Pima Indians Diabetes

fetch_pima_diabetes(target_col=None)

Binary (2) · hard

jbrownlee mirror

MiniBooNE Particle ID

fetch_miniboone()

Binary (2) · med-hard

UCI #199

7.2.2. Multiclass (tabular)#

Dataset

Fetcher (key params)

Task / classes

Source

Optical / Pen-Based Digits

fetch_digits_optical_penbased(which='optical'|'penbased', target_col=None)

Multiclass (10) · easy

UCI #80 / #81

Dry Bean

fetch_dry_bean(target_col=None)

Multiclass (7) · easy-med

UCI #602

Covertype

fetch_covertype(target_col=None)

Multiclass (7) · hard

UCI #31

Yeast

fetch_yeast(target_col=None)

Multiclass (10) · hard

UCI #110

Sensorless Drive Diagnosis

fetch_sensorless_drive(target_col=None)

Multiclass (11) · med-hard

UCI #325

Statlog (Shuttle)

fetch_statlog_shuttle(target_col=None)

Multiclass (7) · hard (imbalanced)

UCI #148

7.2.3. Ordinal#

Dataset

Fetcher (key params)

Task / classes

Source

Wine Quality

fetch_wine_quality(target_col=None)

Ordinal (quality 3–9)

UCI #186

LeQua 2024

fetch_lequa2024(task='T1'|'T2'|'T3'|'T4', include_test=False)

T1 binary · T2 multiclass-28 · T3 ordinal · T4 binary (covariate)

Zenodo

7.2.4. Text#

Dataset

Fetcher (key params)

Task / classes

Source

RCV1-v2

fetch_rcv1_v2()

Multilabel topic (sparse)

via scikit-learn

20 Newsgroups

fetch_newsgroups20(subset='train'|'test')

Multiclass (20)

figshare

IMDB

fetch_imdb(subset='train'|'test')

Binary (2)

Stanford

Multi-Domain Sentiment

fetch_multidomain_sentiment(domain='books'|'dvd'|'electronics'|'kitchen')

Binary (2) · covariate / domain shift

JHU (Blitzer)

7.2.5. Temporal / time-shift#

Dataset

Fetcher (key params)

Task / classes

Source

Sentiment140

fetch_sentiment140()

Binary (2) · temporal

Stanford

Electricity (Elec2)

fetch_electricity_elec2(target_col=None)

Binary (2) · concept drift

scikit-multiflow

Airlines (flight delay)

fetch_airlines(target_col=None)

Binary (2) · concept drift

scikit-multiflow

Online News Popularity

fetch_online_news_popularity(threshold=1400)

Binary (popularity thresholded at threshold)

UCI #332

7.2.6. Image, graph & synthetic (other shift)#

Dataset

Fetcher (key params)

Task / classes

Source

SEA Concepts

fetch_sea_concepts(n=50000, drift=True, noise=0.1)

Binary (2) · concept drift (generated, no download)

synthetic

MNIST → USPS

fetch_mnist_usps(domain='mnist'|'usps', subset='train'|'test'|'all')

Multiclass (10) · covariate shift

LIBSVM

CIFAR-10

fetch_cifar10(subset='train'|'test'|'all')

Multiclass (10) · label / prior shift

Toronto

Planetoid (Cora / CiteSeer / PubMed)

fetch_planetoid_cora_citeseer_pubmed(name='cora'|'citeseer'|'pubmed')

Multiclass (3–7) · graph, covariate / structural shift

GitHub (Planetoid)

7.3. Optional download progress#

The fetchers print nothing by default and pull in no progress dependency. To display download progress, register any callable with set_progress_hook; every fetcher will then report its downloads to it as hook(downloaded, total, url) (total is None when the server sends no Content-Length). This keeps the choice of progress tool — plain print, logging or tqdm — entirely yours:

from mlquantify.datasets import set_progress_hook
from tqdm import tqdm          # your dependency, not mlquantify's

bars = {}
def show(downloaded, total, url):
    bar = bars.setdefault(url, tqdm(total=total, unit="B", unit_scale=True))
    bar.update(downloaded - bar.n)

set_progress_hook(show)        # all fetchers now report progress

7.4. References#

The table below lists, for each dataset, the quantification works in the literature that use it (numbers refer to the reference list underneath).

Dataset

Used by — quantification papers

Mushroom

[1] [2] [3]

Banknote Authentication

— (new addition)

Haberman’s Survival

[1] [4] [5] [6]

Pima Indians Diabetes

[1] [5] [6] [7] [8]

MiniBooNE Particle ID

[9]

Optical / Pen-Based Digits

[10] [11]

Dry Bean

— (new addition)

Covertype

[2] [6] [10] [12] [13]

Yeast

[5] [6] [7] [14]

Sensorless Drive Diagnosis

— (new addition)

Statlog (Shuttle)

[8] [9]

Wine Quality

[2] [6] [7] [11] [14] [15]

LeQua 2024 (T1–T4)

[16] [17]

RCV1-v2

[18] [19]

20 Newsgroups

[19]

IMDB

[19] [20]

Multi-Domain Sentiment

— (new addition)

Sentiment140

[21]

Electricity (Elec2)

— (new addition)

Airlines

— (new addition)

Online News Popularity

[21]

SEA Concepts

— (new addition)

MNIST → USPS

[22]

CIFAR-10

[22] [23]

Planetoid (Cora / CiteSeer / PubMed)

[24]

Datasets marked new addition are recent additions to the library that are not yet used by a quantification work in the surveyed bibliography; see the Source column above for each dataset’s original reference.

See also