fetch_sentiment140#

mlquantify.datasets.fetch_sentiment140(*, data_home=None, download_if_missing=True, return_X_y=False, as_frame=False, n_retries=3, delay=1.0, protocol=None, n_samples=1000, sample_size=500, random_state=None)[source]#

Sentiment140: 1.6M timestamped tweets (text, binary, temporal).

1.6 million tweets weakly labelled by emoticons as negative (0) or positive (1), in original chronological order (the date field is exposed in the Bunch). Returned as raw tweet text + 0/1 labels; ideal for estimating the sentiment-trend curve over time.

Quantification: a real timeline for quantification-over-time of sentiment.

Documents

1600000

Features

raw text (+ date)

Classes

2 (balanced 0/4)

Order

chronological

Source: https://www.kaggle.com/datasets/kazanova/sentiment140 (Stanford mirror)

Parameters:
data_homestr or path-like, default=None

Folder used to cache the downloaded file(s); defaults to _data/ next to the package.

download_if_missingbool, default=True

If False, raise instead of downloading when the cache is empty.

return_X_ybool, default=False

Return (X, y) instead of a Bunch.

as_framebool, default=False

Return .data as a DataFrame, .target as a Series, and a combined .frame (features + a "target" column).

n_retriesint, default=3

Number of download attempts before giving up.

delayfloat, default=1.0

Seconds to wait between attempts.

protocol{None, “app”, “npp”, “upp”, “ppp”} or mlquantify protocol, default=None

If set, draw evaluation sample-bags with an mlquantify protocol; the Bunch then also has .samples (index bags into .data), .prevalences and .protocol.

n_samplesint, default=1000

Number of prevalence points (bags) generated by the protocol.

sample_sizeint, default=500

Instances per bag (the protocol batch_size).

random_stateint or None, default=None

Seed forwarded to the protocol.

Returns:
dataBunch

Dictionary-like object. Attributes: data (features), target (labels), feature_names, target_names, DESCR; frame when as_frame=True; and samples / prevalences / protocol when protocol is set.

(X, y)tuple

Returned instead when return_X_y=True.

References

Go, A., Bhayani, R. & Huang, L. (2009). Twitter sentiment classification. Stanford.

Examples

>>> b = fetch_sentiment140(); len(b.data)  
1600000