.. _datasets: .. _real_world_datasets: .. currentmodule:: mlquantify.datasets ==================== Real-World Datasets ==================== Alongside the synthetic generator (:ref:`synthetic_datasets`), ``mlquantify.datasets`` ships **25 dataset fetchers** that download and cache well-known quantification benchmarks. They follow the same conventions as :mod:`sklearn.datasets`: every loader is a keyword-only ``fetch_(...)`` function that returns a :class:`Bunch` (or a plain ``(X, y)`` tuple), caching the raw file under a local ``data_home`` so it is downloaded only once. What sets them apart from the scikit-learn loaders is the optional **quantification protocol**: pass ``protocol="app"`` (or another protocol) and the returned :class:`Bunch` additionally carries ``.samples`` and ``.prevalences`` — a collection of test *bags* with known class prevalence, ready to score a quantifier against. .. code-block:: python from mlquantify.datasets import fetch_mushroom b = fetch_mushroom() # plain load -> Bunch(data, target, ...) X, y = fetch_mushroom(return_X_y=True) # sklearn-style tuple # quantification mode: 1000 bags of 500 instances drawn with the # Artificial Prevalence Protocol b = fetch_mushroom(protocol="app", n_samples=1000, sample_size=500, random_state=0) b.samples # list of index arrays into b.data, one per bag b.prevalences # (n_samples, n_classes) array of TRUE prevalences Shared configuration ==================================================================== Every fetcher accepts the same base keyword arguments: .. list-table:: :header-rows: 1 :widths: 24 76 * - Parameter - Meaning * - ``data_home=None`` - Folder used to cache the downloaded file(s). Defaults to a ``_data/`` directory next to the package. * - ``download_if_missing=True`` - If ``False``, raise instead of downloading when the cache is empty. * - ``return_X_y=False`` - Return ``(X, y)`` instead of a :class:`Bunch`. * - ``as_frame=False`` - Return ``.data`` as a :class:`pandas.DataFrame` and ``.target`` as a :class:`pandas.Series` (with a combined ``.frame``), where applicable. * - ``n_retries=3`` - Number of download attempts before giving up. * - ``delay=1.0`` - Seconds to wait between attempts. * - ``protocol=None`` - Quantification sampling protocol: ``None`` (no bags), ``"app"``, ``"npp"``, ``"upp"``, ``"ppp"``, or an mlquantify protocol instance. When set, the :class:`Bunch` also exposes ``.samples``, ``.prevalences`` and ``.protocol``. * - ``n_samples=1000`` - Number of prevalence points (bags) the protocol generates. * - ``sample_size=500`` - Instances per bag. * - ``random_state=None`` - Seed forwarded to the protocol. The ``protocol`` / ``n_samples`` / ``sample_size`` triple is the quantification-specific part of the API; see :func:`make_protocol` and :ref:`synthetic_datasets` for how protocols turn a labelled table into evaluation bags. Per-dataset configuration ==================================================================== Each dataset adds a few of its own keyword arguments on top of the shared ones (shown in the *Fetcher (key params)* column). ``target_col=`` lets you override the column used as the label for the tabular CSV loaders. Binary (tabular) -------------------------------------------------------------------- .. list-table:: :header-rows: 1 :widths: 26 36 20 18 * - Dataset - Fetcher (key params) - Task / classes - Source * - Mushroom - ``fetch_mushroom(target_col=None)`` - Binary (2) · easy - UCI #73 * - Banknote Authentication - ``fetch_banknote_authentication(target_col=None)`` - Binary (2) · easy - UCI #267 * - Haberman's Survival - ``fetch_haberman_survival(target_col=None)`` - Binary (2) · hard - UCI #43 * - Pima Indians Diabetes - ``fetch_pima_diabetes(target_col=None)`` - Binary (2) · hard - jbrownlee mirror * - MiniBooNE Particle ID - ``fetch_miniboone()`` - Binary (2) · med-hard - UCI #199 Multiclass (tabular) -------------------------------------------------------------------- .. list-table:: :header-rows: 1 :widths: 26 36 20 18 * - Dataset - Fetcher (key params) - Task / classes - Source * - Optical / Pen-Based Digits - ``fetch_digits_optical_penbased(which='optical'|'penbased', target_col=None)`` - Multiclass (10) · easy - UCI #80 / #81 * - Dry Bean - ``fetch_dry_bean(target_col=None)`` - Multiclass (7) · easy-med - UCI #602 * - Covertype - ``fetch_covertype(target_col=None)`` - Multiclass (7) · hard - UCI #31 * - Yeast - ``fetch_yeast(target_col=None)`` - Multiclass (10) · hard - UCI #110 * - Sensorless Drive Diagnosis - ``fetch_sensorless_drive(target_col=None)`` - Multiclass (11) · med-hard - UCI #325 * - Statlog (Shuttle) - ``fetch_statlog_shuttle(target_col=None)`` - Multiclass (7) · hard (imbalanced) - UCI #148 Ordinal -------------------------------------------------------------------- .. list-table:: :header-rows: 1 :widths: 26 36 20 18 * - Dataset - Fetcher (key params) - Task / classes - Source * - Wine Quality - ``fetch_wine_quality(target_col=None)`` - Ordinal (quality 3–9) - UCI #186 * - LeQua 2024 - ``fetch_lequa2024(task='T1'|'T2'|'T3'|'T4', include_test=False)`` - T1 binary · T2 multiclass-28 · T3 ordinal · T4 binary (covariate) - Zenodo Text -------------------------------------------------------------------- .. list-table:: :header-rows: 1 :widths: 26 36 20 18 * - Dataset - Fetcher (key params) - Task / classes - Source * - RCV1-v2 - ``fetch_rcv1_v2()`` - Multilabel topic (sparse) - via scikit-learn * - 20 Newsgroups - ``fetch_newsgroups20(subset='train'|'test')`` - Multiclass (20) - figshare * - IMDB - ``fetch_imdb(subset='train'|'test')`` - Binary (2) - Stanford * - Multi-Domain Sentiment - ``fetch_multidomain_sentiment(domain='books'|'dvd'|'electronics'|'kitchen')`` - Binary (2) · covariate / domain shift - JHU (Blitzer) Temporal / time-shift -------------------------------------------------------------------- .. list-table:: :header-rows: 1 :widths: 26 36 20 18 * - Dataset - Fetcher (key params) - Task / classes - Source * - Sentiment140 - ``fetch_sentiment140()`` - Binary (2) · temporal - Stanford * - Electricity (Elec2) - ``fetch_electricity_elec2(target_col=None)`` - Binary (2) · concept drift - scikit-multiflow * - Airlines (flight delay) - ``fetch_airlines(target_col=None)`` - Binary (2) · concept drift - scikit-multiflow * - Online News Popularity - ``fetch_online_news_popularity(threshold=1400)`` - Binary (popularity thresholded at ``threshold``) - UCI #332 Image, graph & synthetic (other shift) -------------------------------------------------------------------- .. list-table:: :header-rows: 1 :widths: 26 36 20 18 * - Dataset - Fetcher (key params) - Task / classes - Source * - SEA Concepts - ``fetch_sea_concepts(n=50000, drift=True, noise=0.1)`` - Binary (2) · concept drift (generated, no download) - synthetic * - MNIST → USPS - ``fetch_mnist_usps(domain='mnist'|'usps', subset='train'|'test'|'all')`` - Multiclass (10) · covariate shift - LIBSVM * - CIFAR-10 - ``fetch_cifar10(subset='train'|'test'|'all')`` - Multiclass (10) · label / prior shift - Toronto * - Planetoid (Cora / CiteSeer / PubMed) - ``fetch_planetoid_cora_citeseer_pubmed(name='cora'|'citeseer'|'pubmed')`` - Multiclass (3–7) · graph, covariate / structural shift - GitHub (Planetoid) Optional download progress ==================================================================== The fetchers print nothing by default and pull in **no** progress dependency. To display download progress, register any callable with :func:`set_progress_hook`; every fetcher will then report its downloads to it as ``hook(downloaded, total, url)`` (``total`` is ``None`` when the server sends no ``Content-Length``). This keeps the choice of progress tool — plain ``print``, ``logging`` or ``tqdm`` — entirely yours: .. code-block:: python from mlquantify.datasets import set_progress_hook from tqdm import tqdm # your dependency, not mlquantify's bars = {} def show(downloaded, total, url): bar = bars.setdefault(url, tqdm(total=total, unit="B", unit_scale=True)) bar.update(downloaded - bar.n) set_progress_hook(show) # all fetchers now report progress References ==================================================================== The table below lists, for each dataset, the quantification works in the literature that use it (numbers refer to the reference list underneath). .. list-table:: :header-rows: 1 :widths: 40 60 * - Dataset - Used by — quantification papers * - Mushroom - [1]_ [2]_ [3]_ * - Banknote Authentication - — (new addition) * - Haberman's Survival - [1]_ [4]_ [5]_ [6]_ * - Pima Indians Diabetes - [1]_ [5]_ [6]_ [7]_ [8]_ * - MiniBooNE Particle ID - [9]_ * - Optical / Pen-Based Digits - [10]_ [11]_ * - Dry Bean - — (new addition) * - Covertype - [2]_ [6]_ [10]_ [12]_ [13]_ * - Yeast - [5]_ [6]_ [7]_ [14]_ * - Sensorless Drive Diagnosis - — (new addition) * - Statlog (Shuttle) - [8]_ [9]_ * - Wine Quality - [2]_ [6]_ [7]_ [11]_ [14]_ [15]_ * - LeQua 2024 (T1–T4) - [16]_ [17]_ * - RCV1-v2 - [18]_ [19]_ * - 20 Newsgroups - [19]_ * - IMDB - [19]_ [20]_ * - Multi-Domain Sentiment - — (new addition) * - Sentiment140 - [21]_ * - Electricity (Elec2) - — (new addition) * - Airlines - — (new addition) * - Online News Popularity - [21]_ * - SEA Concepts - — (new addition) * - MNIST → USPS - [22]_ * - CIFAR-10 - [22]_ [23]_ * - Planetoid (Cora / CiteSeer / PubMed) - [24]_ Datasets marked *new addition* are recent additions to the library that are not yet used by a quantification work in the surveyed bibliography; see the *Source* column above for each dataset's original reference. .. [1] Bella, A., Ferri, C., Hernández-Orallo, J., Ramírez-Quintana, M.J. (2010). Quantification via Probability Estimators. *IEEE ICDM 2010*, 737–742. .. [2] Maletzke, A., dos Reis, D., Cherman, E., Batista, G. (2019). DyS: A Framework for Mixture Models in Quantification. *AAAI 2019*, 4552–4560. .. [3] Maletzke, A. (2019). Binary Quantification in Non-Stationary Scenarios. PhD thesis, ICMC, Universidade de São Paulo. .. [4] Barranquero, J., González, P., Díez, J., del Coz, J.J. (2013). On the study of nearest neighbor algorithms for prevalence estimation in binary problems. *Pattern Recognition* 46(2), 472–482. .. [5] Pérez-Gállego, P., Castaño, A., Quevedo, J.R., del Coz, J.J. (2019). Dynamic ensemble selection for quantification tasks. *Information Fusion* 45, 1–15. .. [6] Bunse, M., et al. (eds.) (2024; 2025). Proceedings of the 4th / 5th International Workshop on Learning to Quantify (LQ 2024 / LQ 2025). .. [7] González-Castro, V., Alaiz-Rodríguez, R., Alegre, E. (2013). Class distribution estimation based on the Hellinger distance. *Information Sciences* 218, 146–164. .. [8] Iyer, A., Nath, S., Sarawagi, S. (2014). Maximum Mean Discrepancy for Class Ratio Estimation: Convergence Bounds and Kernel Selection. *ICML 2014*. .. [9] Schumacher, T., Strohmaier, M., Lemmerich, F. (2023). A Comparative Evaluation of Quantification Methods. arXiv preprint. .. [10] Joachims, T. (2005). A Support Vector Method for Multivariate Performance Measures. *ICML 2005*, 377–384. .. [11] dos Reis, D., Maletzke, A., Cherman, E., Batista, G. (2018). One-class Quantification. *ECML-PKDD 2018*, 273–289. .. [12] Maletzke, A., dos Reis, D., Hassan, W., Batista, G. (2021). Accurately Quantifying under Score Variability (MoSS/DySyn). *IEEE ICDM 2021*. .. [13] Xue, J.C., Weiss, G.M. (2009). Quantification and semi-supervised classification methods for handling changes in class distribution. *ACM SIGKDD 2009*, 897–906. .. [14] Moreo, A., Salvati, M. (2025). An Efficient Method for Deriving Confidence Intervals in Aggregative Quantification. .. [15] Hassan, W., Maletzke, A., Batista, G. (2020). Accurately Quantifying a Billion Instances per Second. *IEEE DSAA 2020*. .. [16] Pérez-Mon, O., Moreo, A., del Coz, J.J., González, P. (2024). Quantification using permutation-invariant networks based on histograms (HistNetQ). *Machine Learning*. .. [17] Pérez-Mon, O., del Coz, J.J., González, P. (2025). Quantification via Gaussian Latent Space Representations (GMNet). .. [18] Hassan, W., Maletzke, A., Batista, G. (2021). Pitfalls in Quantification Assessment. .. [19] Esuli, A., Fabris, A., Moreo, A., Sebastiani, F. (2023). Learning to Quantify. Springer, The Information Retrieval Series. .. [20] Esuli, A., Moreo, A., Sebastiani, F. (2018). A Recurrent Neural Network for Sentiment Quantification (QuaNet). *ACM CIKM 2018*, 1775–1778. .. [21] Li, et al. (2025). Quantification over Time. arXiv preprint. .. [22] Alexandari, A., Kundaje, A., Shrikumar, A. (2020). Maximum Likelihood with Bias-Corrected Calibration is Hard-To-Beat at Label Shift Adaptation. *ICML 2020*. .. [23] Firat, A. (2016). Unified Framework for Quantification. arXiv:1606.00868. .. [24] Damke, C., Hüllermeier, E. (2025). Adjusted Count Quantification Learning on Graphs. preprint. .. seealso:: - :ref:`synthetic_datasets` — the :func:`make_quantification` generator for fully controlled prior / covariate / concept shift. - :mod:`mlquantify.datasets` — the full API reference for every fetcher.