fetch_planetoid_cora_citeseer_pubmed#

mlquantify.datasets.fetch_planetoid_cora_citeseer_pubmed(*, data_home=None, download_if_missing=True, return_X_y=False, as_frame=False, n_retries=3, delay=1.0, protocol=None, n_samples=1000, sample_size=500, random_state=None, name='cora')[source]#

Planetoid citation graphs: Cora / CiteSeer / PubMed (graph nodes).

The three standard Planetoid citation networks selected with name. Nodes are papers with bag-of-words/TF-IDF features and a topic label; .graph holds the adjacency (citation links). Cora: 2708 nodes, 1433 features, 7 classes. CiteSeer: 3327 / 3703 / 6. PubMed: 19717 / 500 / 3.

Quantification: node-level quantification under covariate/structural graph shift.

Nodes

2708 / 3327 / 19717

Features

1433 / 3703 / 500 (sparse)

Classes

7 / 6 / 3

Source: kimiyoung/planetoid

Parameters:
data_homestr or path-like, default=None

Folder used to cache the downloaded file(s); defaults to _data/ next to the package.

download_if_missingbool, default=True

If False, raise instead of downloading when the cache is empty.

return_X_ybool, default=False

Return (X, y) instead of a Bunch.

as_framebool, default=False

Return .data as a DataFrame, .target as a Series, and a combined .frame (features + a "target" column).

n_retriesint, default=3

Number of download attempts before giving up.

delayfloat, default=1.0

Seconds to wait between attempts.

protocol{None, “app”, “npp”, “upp”, “ppp”} or mlquantify protocol, default=None

If set, draw evaluation sample-bags with an mlquantify protocol; the Bunch then also has .samples (index bags into .data), .prevalences and .protocol.

n_samplesint, default=1000

Number of prevalence points (bags) generated by the protocol.

sample_sizeint, default=500

Instances per bag (the protocol batch_size).

random_stateint or None, default=None

Seed forwarded to the protocol.

name{‘cora’, ‘citeseer’, ‘pubmed’}, default=’cora’

Which citation network to load.

Returns:
dataBunch

Dictionary-like object. Attributes: data (features), target (labels), feature_names, target_names, DESCR; frame when as_frame=True; and samples / prevalences / protocol when protocol is set.

(X, y)tuple

Returned instead when return_X_y=True.

References

Yang, Z. et al. (2016). Revisiting semi-supervised learning with graph embeddings. ICML 2016. Sen, P. et al. (2008). Collective classification in network data. AI Magazine.

Examples

>>> b = fetch_planetoid_cora_citeseer_pubmed(name='cora'); b.data.shape  
(2708, 1433)  sparse; b.graph has the edges