mlquantify.datasets#

Synthetic and helper datasets for quantification.

Mirrors the spirit of sklearn.datasets, but the generators produce bags (samples with controlled class prevalence) suitable for evaluating quantifiers under distribution shift.

make_quantification

Generate synthetic quantification bags under prior-probability shift.

fetch_mushroom

Mushroom: edible vs.

fetch_banknote_authentication

Banknote authentication from wavelet image features (binary).

fetch_haberman_survival

Haberman survival after breast-cancer surgery (binary, hard).

fetch_miniboone

MiniBooNE particle identification: signal vs.

fetch_digits_optical_penbased

Optical / Pen-based handwritten digits (10-class, easy).

fetch_dry_bean

Dry Bean: seven bean varieties from grain morphology (multiclass).

fetch_covertype

Forest Covertype: 7 cover types from cartographic variables (multiclass, large).

fetch_yeast

Yeast protein localization site (10-class, hard, imbalanced).

fetch_sensorless_drive

Sensorless drive diagnosis from motor current signals (11-class, balanced).

fetch_statlog_shuttle

Statlog (Shuttle): space-shuttle radiator states (multiclass, extreme imbalance).

fetch_wine_quality

Wine Quality: sensory score 3-9 from physicochemistry (ORDINAL).

fetch_online_news_popularity

Online News Popularity: will an article be popular? (binary, temporal).

fetch_pima_diabetes

Pima Indians Diabetes (binary, hard, noisy medical).

fetch_electricity_elec2

Electricity (Elec2): NSW market price up/down stream (binary, drift).

fetch_airlines

Airlines: flight-delay stream (binary, large, temporal).

fetch_newsgroups20

20 Newsgroups: Usenet posts in 20 topics (text, multiclass).

fetch_imdb

IMDB Large Movie Review sentiment (text, binary, balanced).

fetch_multidomain_sentiment

Multi-Domain (Blitzer) Amazon review sentiment (text, covariate shift).

fetch_sentiment140

Sentiment140: 1.6M timestamped tweets (text, binary, temporal).

fetch_rcv1_v2

RCV1-v2: Reuters news topics (text, sparse TF-IDF, multilabel).

fetch_mnist_usps

MNIST -> USPS handwritten digits (image, covariate shift).

fetch_cifar10

CIFAR-10 natural images (image, 10-class, balanced).

fetch_planetoid_cora_citeseer_pubmed

Planetoid citation graphs: Cora / CiteSeer / PubMed (graph nodes).

fetch_sea_concepts

SEA Concepts: synthetic stream with abrupt concept drift (binary).

fetch_lequa2024

LeQua 2024 competition vectors, all tasks via task (text/ordinal).

Bunch

get_data_home

fetch_remote

urllib download with local cache + retries (like sklearn).