AutoGL Dataset¶
We import the module of datasets from CogDL and PyTorch Geometric and add support for datasets from OGB. One can refer to the usage of creating and building datasets via the tutorial of CogDL, PyTorch Geometric, and OGB.
Supporting datasets¶
AutoGL now supports the following benchmarks for different tasks:
Semi-supervised node classification: Cora, Citeseer, Pubmed, Amazon Computers*, Amazon Photo*, Coauthor CS*, Coauthor Physics*, Reddit (*: using utils.random_splits_mask_class for splitting dataset is recommended.). For detailed information for supporting datasets, please kindly refer to PyTorch Geometric Dataset.
Dataset |
PyG |
CogDL |
x |
y |
edge_index |
edge_attr | train/val/test node |
train/val/test mask |
|
---|---|---|---|---|---|---|---|---|
Cora |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
||
Citeseer |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
||
Pubmed |
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
||
Amazon Computers |
✓ |
✓ |
✓ |
✓ |
✓ |
|||
Amazon Photo |
✓ |
✓ |
✓ |
✓ |
✓ |
|||
Coauthor CS |
✓ |
✓ |
✓ |
✓ |
✓ |
|||
Coauthor Physics |
✓ |
✓ |
✓ |
✓ |
✓ |
|||
✓ |
✓ |
✓ |
✓ |
✓ |
✓ |
Graph classification: MUTAG, IMDB-B, IMDB-M, PROTEINS, COLLAB
Dataset |
PyG |
CogDL |
x |
y |
edge_index |
edge_attr |
---|---|---|---|---|---|---|
MUTAG |
✓ |
✓ |
✓ |
✓ |
✓ |
|
IMDB-B |
✓ |
✓ |
✓ |
|||
IMDB-M |
✓ |
✓ |
✓ |
|||
PROTEINS |
✓ |
✓ |
✓ |
✓ |
||
COLLAB |
✓ |
✓ |
✓ |
TODO: Supporting all datasets from PyTorch Geometric.
OGB datasets¶
AutoGL also supports the popular benchmark on OGB for node classification and graph classification tasks. For the summary of OGB datasets, please kindly refer to the their docs.
Since the loss and evaluation metric used for OGB datasets vary among different tasks, we also add string properties of datasets for identification:
Dataset |
dataset.metric |
datasets.loss |
---|---|---|
ogbn-products |
Accuracy |
nll_loss |
ogbn-proteins |
ROC-AUC |
BCEWithLogitsLoss |
ogbn-arxiv |
Accuracy |
nll_loss |
ogbn-papers100M |
Accuracy |
nll_loss |
ogbn-mag |
Accuracy |
nll_loss |
ogbg-molhiv |
ROC-AUC |
BCEWithLogitsLoss |
ogbg-molpcba |
AP |
BCEWithLogitsLoss |
ogbg-ppa |
Accuracy |
CrossEntropyLoss |
ogbg-code |
F1 score |
CrossEntropyLoss |
Create a dataset via URL¶
If your dataset is the same as the ‘ppi’ dataset, which contains two matrices: ‘network’ and ‘group’, you can register your dataset directly use the above code. The default root for downloading dataset is ~/.cache-autogl, you can also specify the root by passing the string to the path in build_dataset(args, path) or build_dataset_from_name(dataset, path).
# following code-snippet is from autogl/datasets/matlab_matrix.py
@register_dataset("ppi")
class PPIDataset(MatlabMatrix):
def __init__(self, path):
dataset, filename = "ppi", "Homo_sapiens"
url = "http://snap.stanford.edu/node2vec/"
super(PPIDataset, self).__init__(path, filename, url)
You should declare the name of the dataset, the name of the file, and the URL, where our script can download the resource. Then you can use either build_dataset(args, path) or build_dataset_from_name(dataset, path) in your task to build a dataset with corresponding parameters.
Create a dataset locally¶
If you want to test your local dataset, we recommend you to refer to the docs on creating PyTorch Geometric dataset.
You can simply inherit from torch_geometric.data.InMemoryDataset to create an empty dataset, then create some torch_geometric.data.Data objects for your data and pass a regular python list holding them, then pass them to torch_geometric.data.Dataset or torch_geometric.data.DataLoader. Let’s see this process in a simplified example:
from typing import Iterable
from torch_geometric.data.data import Data
from autogl.datasets import build_dataset_from_name
from torch_geometric.data import InMemoryDataset
class MyDataset(InMemoryDataset):
def __init__(self, datalist) -> None:
super().__init__()
self.data, self.slices = self.collate(datalist)
# Create your own Data objects
# for example, if you have edge_index, features and labels
# you can create a Data as follows
# See pytorch geometric more info of Data
data = Data()
data.edge_index = edge_index
data.x = features
data.y = labels
# create a list of Data object
data_list = [data, Data(...), ..., Data(...)]
# Initialize AutoGL Dataset with your own data
myData = MyDataset(data_list)