fetch_imdb#

mlquantify.datasets.fetch_imdb(*, data_home=None, download_if_missing=True, return_X_y=False, as_frame=False, n_retries=3, delay=1.0, protocol=None, n_samples=1000, sample_size=500, random_state=None, subset='train')[source]#

IMDB Large Movie Review sentiment (text, binary, balanced).

50000 polar movie reviews split evenly into 25000 train and 25000 test, each split balanced between positive and negative. Long, free-text documents; returned as raw text + 0/1 labels.

Quantification: the binary sentiment workhorse for text quantification.

Documents	25000 per split
Features	raw text
Classes	2 (balanced)

Source: https://ai.stanford.edu/~amaas/data/sentiment/

Parameters:

data_homestr or path-like, default=None: Folder used to cache the downloaded file(s); defaults to _data/ next to the package.
download_if_missingbool, default=True: If False, raise instead of downloading when the cache is empty.
return_X_ybool, default=False: Return (X, y) instead of a Bunch.
as_framebool, default=False: Return .data as a DataFrame, .target as a Series, and a combined .frame (features + a "target" column).
n_retriesint, default=3: Number of download attempts before giving up.
delayfloat, default=1.0: Seconds to wait between attempts.
protocol{None, “app”, “npp”, “upp”, “ppp”} or mlquantify protocol, default=None: If set, draw evaluation sample-bags with an mlquantify protocol; the Bunch then also has .samples (index bags into .data), .prevalences and .protocol.
n_samplesint, default=1000: Number of prevalence points (bags) generated by the protocol.
sample_sizeint, default=500: Instances per bag (the protocol batch_size).
random_stateint or None, default=None: Seed forwarded to the protocol.
subset{‘train’, ‘test’}, default=’train’: Which split to load.

Returns:

dataBunch: Dictionary-like object. Attributes: data (features), target (labels), feature_names, target_names, DESCR; frame when as_frame=True; and samples / prevalences / protocol when protocol is set.
(X, y)tuple: Returned instead when return_X_y=True.

References

Maas, A. et al. (2011). Learning word vectors for sentiment analysis. ACL 2011.

Examples

>>> b = fetch_imdb(subset='test'); len(b.data)  
25000