RandomLinkSplit and LinkNeighborLoader for a series of Heterogenous graphs in a Dataset object class #7181

snknitin · 2023-04-15T18:09:53Z

snknitin
Apr 15, 2023

Hello PyG Community,

I have built my own Dataset(larger InMemoryDataset) in pyg which is a series of 365 daily graphs. The graphs are HeteroData objects with 3 nodes and 3 edges (with attr in both). I'm trying to build a dynamic link prediction module, but unable to figure out how to apply the RandomLinkSplit over the whole dataset. I keep running into an error. Also, indexing a particular graph from the dataset to sample negative edges, isn't working but if i create a blank HeteroData and copy all the features and information from the indexed graph and then apply RandomLinkSplit, it works.

This is how I use it currently for a single graph

# For this, we first split the set of edges into
# training (80%), validation (10%), and testing edges (10%).
# Across the training edges, we use 70% of edges for message passing, and 30% of edges for supervision.
# We further want to generate fixed negative edges for evaluation with a ratio of 2:1.
# Negative edges during training will be generated on-the-fly.


data= HeteroData()
# Copy everything from Dataset[0] into data 


transform = T.RandomLinkSplit(
    num_val=0.1,
    num_test=0.1,
    disjoint_train_ratio=0.3,
    neg_sampling_ratio=2.0,
    add_negative_train_samples=False,
    edge_types=("node1", "to", "node2"),
#                 ("node2", "to", "node3"),
#                 ("node3", "to", "node1")],
    rev_edge_types=("node2", "rev_to", "node1"),
#                     ("node3", "rev_to", "node2"),
#                    ("node1", "rev_to", "node3")]
)
train_data, val_data, test_data = transform(data)


from torch_geometric.loader import LinkNeighborLoader
        
# Define seed edges:
edge_label_index = train_data["node1", "to", "node2"].edge_label_index
edge_label = train_data["node1", "to", "node2"].edge_label.type(torch.LongTensor)
train_loader = LinkNeighborLoader(
    data=train_data,
    num_neighbors=[50, 10],
    neg_sampling_ratio=2.0,
    edge_label_index=(("node1", "to", "node2"), edge_label_index),
    edge_label=edge_label,
    batch_size=128,
    shuffle=True,
)

but that only builds a train test split for a single graph. I ideally want the train and test data to have positive and negative edges from all 365 days and make it larger.

Along with that, the random neighbor sampling for the LinkNeighborLoader also gives different dimensions(number of edges i the graph) for the other 2 edge_types, which is problematic since I intend to map and use the edge attributes and concatenate them with the new node features from the encoder(Sageconv) in the model for binary classification. I'm only predicting the Link for one of the edge_types. the other 2 are auxilliary information that needs to be factored into the decision

Am i using these methods wrong? I was wondering if there was a clever and easy way to do this with the helper functions and methods available already, that maybe i'm not aware of or thought of trying.

In conclusion

How to i apply randomlinksplit to all 365 graphs at once and then split into a larger train, test split that has edges from each timeframe?
How do i sample or load minibatches such that the other edge types have the exact same number of edges that correspond to the same event(in this case a transaction) so that the edge attributes can be a factor in deciding the validity/probabiity of a link?

Any advice or guidance would be helpful !

Answered by rusty1s

Apr 19, 2023

If you are working with multiple graphs, the best choice for dataloading should be the default DataLoader of PyG. In your case, this is a bit problematic because the transform returns a tuple of data objects. As such, you need to convert that into three dataset, which you can do via

train_dataset, val_dataset, test_dataset = zip(*dataset)

I added a short test for this, see #7211

View full answer

rusty1s · 2023-04-19T14:50:20Z

rusty1s
Apr 19, 2023
Maintainer

If you are working with multiple graphs, the best choice for dataloading should be the default DataLoader of PyG. In your case, this is a bit problematic because the transform returns a tuple of data objects. As such, you need to convert that into three dataset, which you can do via

train_dataset, val_dataset, test_dataset = zip(*dataset)

I added a short test for this, see #7211

4 replies

snknitin Apr 20, 2023
Author

Thank you so much. That is the hack I was thinking of. I wanted to know if there was already a provision to use RandomLinkSplit directly because in one for the discussion posts I saw you suggest a num_graphs parameter in the method

rusty1s Apr 21, 2023
Maintainer

Can you point me to the discussion you are referring to? Not totally sure I understand what you mean, sorry.

snknitin Apr 25, 2023
Author

it was in an example code snippet here but i think the inclusion of number of graphs was in the hetero data object. for some reason passing the dataset object gives error, so i did it for a single graph first

wgeul Oct 9, 2024

If you are working with multiple graphs, the best choice for dataloading should be the default DataLoader of PyG. In your case, this is a bit problematic because the transform returns a tuple of data objects. As such, you need to convert that into three dataset, which you can do via
train_dataset, val_dataset, test_dataset = zip(*dataset)
I added a short test for this, see #7211

Hi @rusty1s,
I'm trying to do the same thing.

def test_random_link_split_on_dataset(get_dataset):
    dataset = get_dataset(name='MUTAG')

    dataset.transform = RandomLinkSplit(
        num_val=0.1,
        num_test=0.1,
        disjoint_train_ratio=0.3,
        add_negative_train_samples=False,
    )

    train_dataset, val_dataset, test_dataset = zip(*dataset)

yields object train_dataset of type tuple with n objects relating to the dataset split between train/val/test.

Q: Can you advise on how do I proceed with the LinkNeighborLoader implementation?
Given that train_data is now a tuple filled with Data or HeteroData objects?

    edge_label_index = train_data[target_edge].edge_label_index
    edge_label = train_data[target_edge].edge_label

    train_loader = LinkNeighborLoader(
        data=train_data,
        num_neighbors=[20, 10],
        neg_sampling_ratio=2.0,
        edge_label_index=((target_edge), edge_label_index),
        edge_label=edge_label,
        batch_size=128,
        shuffle=True,
    )

I suspect torch_geometric.data.Batch should be used, is this correct? e.g.:

from torch_geometric.data import Batch

train_data, val_data, test_data = (
    Batch.from_data_list(train_data),
    Batch.from_data_list(val_data),
    Batch.from_data_list(test_data),
)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RandomLinkSplit and LinkNeighborLoader for a series of Heterogenous graphs in a Dataset object class #7181

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

RandomLinkSplit and LinkNeighborLoader for a series of Heterogenous graphs in a Dataset object class #7181

snknitin Apr 15, 2023

Replies: 1 comment · 4 replies

rusty1s Apr 19, 2023 Maintainer

snknitin Apr 20, 2023 Author

rusty1s Apr 21, 2023 Maintainer

snknitin Apr 25, 2023 Author

wgeul Oct 9, 2024

snknitin
Apr 15, 2023

Replies: 1 comment 4 replies

rusty1s
Apr 19, 2023
Maintainer

snknitin Apr 20, 2023
Author

rusty1s Apr 21, 2023
Maintainer

snknitin Apr 25, 2023
Author