mlquantify.datasets#
Synthetic and helper datasets for quantification.
Mirrors the spirit of sklearn.datasets, but the generators produce
bags (samples with controlled class prevalence) suitable for evaluating
quantifiers under distribution shift.
Generate synthetic quantification bags under prior-probability shift. |
|
Mushroom: edible vs. |
|
Banknote authentication from wavelet image features (binary). |
|
Haberman survival after breast-cancer surgery (binary, hard). |
|
MiniBooNE particle identification: signal vs. |
|
Optical / Pen-based handwritten digits (10-class, easy). |
|
Dry Bean: seven bean varieties from grain morphology (multiclass). |
|
Forest Covertype: 7 cover types from cartographic variables (multiclass, large). |
|
Yeast protein localization site (10-class, hard, imbalanced). |
|
Sensorless drive diagnosis from motor current signals (11-class, balanced). |
|
Statlog (Shuttle): space-shuttle radiator states (multiclass, extreme imbalance). |
|
Wine Quality: sensory score 3-9 from physicochemistry (ORDINAL). |
|
Online News Popularity: will an article be popular? (binary, temporal). |
|
Pima Indians Diabetes (binary, hard, noisy medical). |
|
Electricity (Elec2): NSW market price up/down stream (binary, drift). |
|
Airlines: flight-delay stream (binary, large, temporal). |
|
20 Newsgroups: Usenet posts in 20 topics (text, multiclass). |
|
IMDB Large Movie Review sentiment (text, binary, balanced). |
|
Multi-Domain (Blitzer) Amazon review sentiment (text, covariate shift). |
|
Sentiment140: 1.6M timestamped tweets (text, binary, temporal). |
|
RCV1-v2: Reuters news topics (text, sparse TF-IDF, multilabel). |
|
MNIST -> USPS handwritten digits (image, covariate shift). |
|
CIFAR-10 natural images (image, 10-class, balanced). |
|
Planetoid citation graphs: Cora / CiteSeer / PubMed (graph nodes). |
|
SEA Concepts: synthetic stream with abrupt concept drift (binary). |
|
LeQua 2024 competition vectors, all tasks via |
|
urllib download with local cache + retries (like sklearn). |