fetch_sentiment140#

mlquantify.datasets.fetch_sentiment140(*, data_home=None, download_if_missing=True, return_X_y=False, as_frame=False, n_retries=3, delay=1.0, protocol=None, n_samples=1000, sample_size=500, random_state=None)[source]#

Sentiment140: 1.6M timestamped tweets (text, binary, temporal).

1.6 million tweets weakly labelled by emoticons as negative (0) or positive (1), in original chronological order (the date field is exposed in the Bunch). Returned as raw tweet text + 0/1 labels; ideal for estimating the sentiment-trend curve over time.

Quantification: a real timeline for quantification-over-time of sentiment.

Documents	1600000
Features	raw text (+ date)
Classes	2 (balanced 0/4)
Order	chronological

Source: https://www.kaggle.com/datasets/kazanova/sentiment140 (Stanford mirror)

Parameters:

data_homestr or path-like, default=None: Folder used to cache the downloaded file(s); defaults to _data/ next to the package.
download_if_missingbool, default=True: If False, raise instead of downloading when the cache is empty.
return_X_ybool, default=False: Return (X, y) instead of a Bunch.
as_framebool, default=False: Return .data as a DataFrame, .target as a Series, and a combined .frame (features + a "target" column).
n_retriesint, default=3: Number of download attempts before giving up.
delayfloat, default=1.0: Seconds to wait between attempts.
protocol{None, “app”, “npp”, “upp”, “ppp”} or mlquantify protocol, default=None: If set, draw evaluation sample-bags with an mlquantify protocol; the Bunch then also has .samples (index bags into .data), .prevalences and .protocol.
n_samplesint, default=1000: Number of prevalence points (bags) generated by the protocol.
sample_sizeint, default=500: Instances per bag (the protocol batch_size).
random_stateint or None, default=None: Seed forwarded to the protocol.

Returns:

dataBunch: Dictionary-like object. Attributes: data (features), target (labels), feature_names, target_names, DESCR; frame when as_frame=True; and samples / prevalences / protocol when protocol is set.
(X, y)tuple: Returned instead when return_X_y=True.

References

Go, A., Bhayani, R. & Huang, L. (2009). Twitter sentiment classification. Stanford.

Examples

>>> b = fetch_sentiment140(); len(b.data)  
1600000