Understanding LinkNeighborLoader on collated data #6519

GianlucaDeStefano · 2023-01-25T22:33:51Z

GianlucaDeStefano
Jan 25, 2023

Hi all,
I am trying to teach a model to perform link prediction, and my settings are the following:

I have a set of graphs I want to use to train my model
I have multiple types of edges and nodes, and I am trying to predict a single type of edge between two well-specified types of nodes.

To do this, I am doing the following:

I split my graphs into train, validation, test
I collate all the graphs in the same split together generating 3 huge distinct graphs for training, validation and test respectivelly
From these 3 graphs, I use the LinkNeighborLoader to load examples and train a network.

My doubt is this:
Since I am using LinkNeighborLoader, I expect batches containing the same number of examples (edges to predict) to have a similar size. However, I have noticed that, on average, a larger collated graph yields larger (with more nodes) batches. Why does this happen? Do I misunderstand something here?

In my view, this should not happen since the LinkNeighborLoader class only loads the neighbors of the edge in question, and since the collate function should create a Data object containing multiple distinct graphs, the same edge sampled from the huge collated graph or the original should yield the same neighborhood.

I have created an example to explain better what I am trying to say here:

import torch
from torch_geometric.data import HeteroData, InMemoryDataset
from torch_geometric.loader import LinkNeighborLoader


def get_heterodata():
    h = HeteroData()

    h['node_1'].x = torch.arange(1000)
    h['node_2'].x = torch.arange(1000)
    h['node_3'].x = torch.arange(1000)

    h['node_1', 'edge_1', 'node_2'].edge_index = torch.stack(
        (torch.randint(0, 1000, (300,)), torch.randint(0, 1000, (300,))), dim=0)

    h['node_1', 'edge_2', 'node_3'].edge_index = torch.stack(
        (torch.randint(0, 1000, (300,)), torch.randint(0, 1000, (300,))), dim=0)
    h['node_2', 'edge_3', 'node_3'].edge_index = torch.stack(
        (torch.randint(0, 1000, (300,)), torch.randint(0, 1000, (300,))), dim=0)

    h['node_2', 'target_edge', 'node_3'].edge_index = torch.stack(
        (torch.randint(0, 1000, (10,)), torch.randint(0, 1000, (10,))), dim=0)

    return h


def prepare_loader(data):

    # Explicitly create tensors of missing edges and labels
    target_edge_indexes = data['node_2', 'target_edge', 'node_3'].edge_index
    target_edges_label = torch.ones(len(target_edge_indexes[0]))

    loader = LinkNeighborLoader(
        data=data,
        num_neighbors=[-1] * 5,
        neg_sampling_ratio=0,
        edge_label_index=(('node_2', 'target_edge', 'node_3'), target_edge_indexes),
        edge_label=target_edges_label,
        batch_size=1,
        shuffle=True,
        drop_last=True
    )

    return loader


train_data_list = [get_heterodata() for i in range(20)]
test_data_list = [get_heterodata() for i in range(2)]

training_graph, _ = InMemoryDataset.collate(train_data_list)
test_graph, _ = InMemoryDataset.collate(test_data_list)

print("Training graph:")
print(training_graph)

print("Test graph:")
print(test_graph)

training_dataloader = prepare_loader(training_graph)
test_dataloader = prepare_loader(test_graph)

print("Training batch:")
print(next(iter(training_dataloader)))

print("Test batch:")
print(next(iter(test_dataloader)))

This code first creates two groups of 20 and 2 graphs, respectively, then it collates them together to form the following two graphs:

Training graph:
HeteroData(
  node_1={ x=[20000] },
  node_2={ x=[20000] },
  node_3={ x=[20000] },
  (node_1, edge_1, node_2)={ edge_index=[2, 6000] },
  (node_1, edge_2, node_3)={ edge_index=[2, 6000] },
  (node_2, edge_3, node_3)={ edge_index=[2, 6000] },
  (node_2, target_edge, node_3)={ edge_index=[2, 200] }
)
Test graph:
HeteroData(
  node_1={ x=[2000] },
  node_2={ x=[2000] },
  node_3={ x=[2000] },
  (node_1, edge_1, node_2)={ edge_index=[2, 600] },
  (node_1, edge_2, node_3)={ edge_index=[2, 600] },
  (node_2, edge_3, node_3)={ edge_index=[2, 600] },
  (node_2, target_edge, node_3)={ edge_index=[2, 20] }
)

By using the LinkNeighborLoader from these two graphs, I then sample 2 batches like these:

Training batch:
HeteroData(
  node_1={ x=[67] },
  node_2={ x=[11] },
  node_3={ x=[1] },
  (node_1, edge_1, node_2)={ edge_index=[2, 60] },
  (node_1, edge_2, node_3)={ edge_index=[2, 8] },
  (node_2, edge_3, node_3)={ edge_index=[2, 10] },
  (node_2, target_edge, node_3)={
    edge_index=[2, 1],
    input_id=[1],
    edge_label_index=[2, 1],
    edge_label=[1]
  }
)
Test batch:
HeteroData(
  node_1={ x=[4] },
  node_2={ x=[2] },
  node_3={ x=[1] },
  (node_1, edge_1, node_2)={ edge_index=[2, 3] },
  (node_1, edge_2, node_3)={ edge_index=[2, 1] },
  (node_2, edge_3, node_3)={ edge_index=[2, 0] },
  (node_2, target_edge, node_3)={
    edge_index=[2, 2],
    input_id=[1],
    edge_label_index=[2, 1],
    edge_label=[1]
  }
)

Why is the graph contained in the training batch always sensibly larger than the one represented by the test batch? (Even if I use a batch size of 1)
Of course, the collated training graph is larger, but, as I said above, in my view, this should not matter.

Do I misunderstand something here?

Thanks in advance

rusty1s · 2023-01-26T11:21:10Z

rusty1s
Jan 26, 2023
Maintainer

Since you are using num_neighbors=[-1] ..., the full neighborhood will be returned. For validation and test, the number of message passing edges is also slightly bigger, since, e.g., the test set also holds validation edges for message passing.

5 replies

GianlucaDeStefano Jan 26, 2023
Author

Thank you for your answer.

Yes, but in the example above, there are no train/validation/test splits; the two graphs are composed of 20 and 2 subgraphs, respectively, and each one of these has the same number of nodes and edges.

Sampling the neighborhood of two edges from the train and test graphs respectively, should, on average, yield two subgraphs with a similar number of nodes and edges.
However, this does not happen.
As shown above, the training batch contains consistently many more nodes than the test batch.

I suspect I may be misunderstanding what the collate function does in practice. In my mind, the collate function merges multiple individual graphs into a single Data object while keeping them isolated. Therefore it should not modify the characteristics of the neighborhood of an edge.
Am I missing something?

rusty1s Jan 26, 2023
Maintainer

Yeah, sorry. I misunderstood that you are not using RandomLinkSplit. Anyway, you are right. collate is not what you expect it to do :) It will not create disjoint graphs, it will just concatenate tensors together. Instead, you should use

train_data = Batch.from_data_list(data_list)

GianlucaDeStefano Jan 26, 2023
Author

Thank you very much!

GianlucaDeStefano Jan 28, 2023
Author

Just to clarify, with 'collate function', I meant the 'self.collate' function used in the InMemoryDataset example here.

If this function just concatenates tensors, isn't the example incorrect?

rusty1s Jan 28, 2023
Maintainer

I don't think the example is incorrect. Can you clarify? InMemoryDataset just collates the data of every Data object together, which helps with data loading speed from disk and is beneficial in shared memory examples. It does not guarantee disjoint subgraphs.

Understanding LinkNeighborLoader on collated data #6519

Uh oh!

Uh oh!

GianlucaDeStefano Jan 25, 2023

Replies: 1 comment · 5 replies

Uh oh!

rusty1s Jan 26, 2023 Maintainer

Uh oh!

Uh oh!

GianlucaDeStefano Jan 26, 2023 Author

Uh oh!

rusty1s Jan 26, 2023 Maintainer

Uh oh!

Uh oh!

GianlucaDeStefano Jan 26, 2023 Author

Uh oh!

Uh oh!

GianlucaDeStefano Jan 28, 2023 Author

Uh oh!

rusty1s Jan 28, 2023 Maintainer

GianlucaDeStefano
Jan 25, 2023

Replies: 1 comment 5 replies

rusty1s
Jan 26, 2023
Maintainer

GianlucaDeStefano Jan 26, 2023
Author

rusty1s Jan 26, 2023
Maintainer

GianlucaDeStefano Jan 26, 2023
Author

GianlucaDeStefano Jan 28, 2023
Author

rusty1s Jan 28, 2023
Maintainer