AutoGL Dataset

We import the module of datasets from CogDL and PyTorch Geometric and add support for datasets from OGB. One can refer to the usage of creating and building datasets via the tutorial of CogDL, PyTorch Geometric, and OGB.

Supporting datasets

AutoGL now supports the following benchmarks for different tasks:

Semi-supervised node classification: Cora, Citeseer, Pubmed, Amazon Computers*, Amazon Photo*, Coauthor CS*, Coauthor Physics*, Reddit (*: using utils.random_splits_mask_class for splitting dataset is recommended.). For detailed information for supporting datasets, please kindly refer to PyTorch Geometric Dataset.

Dataset

PyG

CogDL

x

y

edge_index

edge_attr | train/val/test node

train/val/test mask

Cora

Citeseer

Pubmed

Amazon Computers

Amazon Photo

Coauthor CS

Coauthor Physics

Reddit

Graph classification: MUTAG, IMDB-B, IMDB-M, PROTEINS, COLLAB

Dataset

PyG

CogDL

x

y

edge_index

edge_attr

MUTAG

IMDB-B

IMDB-M

PROTEINS

COLLAB

TODO: Supporting all datasets from PyTorch Geometric.

OGB datasets

AutoGL also supports the popular benchmark on OGB for node classification and graph classification tasks. For the summary of OGB datasets, please kindly refer to the their docs.

Since the loss and evaluation metric used for OGB datasets vary among different tasks, we also add string properties of datasets for identification:

Dataset

dataset.metric

datasets.loss

ogbn-products

Accuracy

nll_loss

ogbn-proteins

ROC-AUC

BCEWithLogitsLoss

ogbn-arxiv

Accuracy

nll_loss

ogbn-papers100M

Accuracy

nll_loss

ogbn-mag

Accuracy

nll_loss

ogbg-molhiv

ROC-AUC

BCEWithLogitsLoss

ogbg-molpcba

AP

BCEWithLogitsLoss

ogbg-ppa

Accuracy

CrossEntropyLoss

ogbg-code

F1 score

CrossEntropyLoss

Create a dataset via URL

If your dataset is the same as the ‘ppi’ dataset, which contains two matrices: ‘network’ and ‘group’, you can register your dataset directly use the above code. The default root for downloading dataset is ~/.cache-autogl, you can also specify the root by passing the string to the path in build_dataset(args, path) or build_dataset_from_name(dataset, path).

# following code-snippet is from autogl/datasets/matlab_matrix.py

@register_dataset("ppi")
class PPIDataset(MatlabMatrix):
    def __init__(self, path):
        dataset, filename = "ppi", "Homo_sapiens"
        url = "http://snap.stanford.edu/node2vec/"
        super(PPIDataset, self).__init__(path, filename, url)

You should declare the name of the dataset, the name of the file, and the URL, where our script can download the resource. Then you can use either build_dataset(args, path) or build_dataset_from_name(dataset, path) in your task to build a dataset with corresponding parameters.

Create a dataset locally

If you want to test your local dataset, we recommend you to refer to the docs on creating PyTorch Geometric dataset.

You can simply inherit from torch_geometric.data.InMemoryDataset to create an empty dataset, then create some torch_geometric.data.Data objects for your data and pass a regular python list holding them, then pass them to torch_geometric.data.Dataset or torch_geometric.data.DataLoader. Let’s see this process in a simplified example:

from typing import Iterable
from torch_geometric.data.data import Data
from autogl.datasets import build_dataset_from_name
from torch_geometric.data import InMemoryDataset

class MyDataset(InMemoryDataset):
    def __init__(self, datalist) -> None:
        super().__init__()
        self.data, self.slices = self.collate(datalist)

# Create your own Data objects

# for example, if you have edge_index, features and labels
# you can create a Data as follows
# See pytorch geometric more info of Data
data = Data()
data.edge_index = edge_index
data.x = features
data.y = labels

# create a list of Data object
data_list = [data, Data(...), ..., Data(...)]

# Initialize AutoGL Dataset with your own data
myData = MyDataset(data_list)