How to Avoid CUDA Memory Error - InMemory HeteroData - Batching and Data Split issue #4785

snknitin · 2022-06-09T07:29:32Z

snknitin
Jun 9, 2022

Hi PyG Community,

I've been wracking my brains on something for too long and would like to request some help.
I had recently build a HeteroData object, for an Edge_label prediction task and was able to successfully train it. Now that I've increased the amount of data and rebuilt the graph, i run into

RuntimeError: CUDA out of memory. Tried to allocate 11.44 GiB (GPU 0; 15.78 GiB total capacity; 6.47 GiB already allocated; 5.47 GiB free; 8.90 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I have tried almost every hack and suggestion posted on multiple forums for such an issue , like

import gc
torch.cuda.empty_cache()
gc.collect()
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

and using with torch.no_grad(): for model.eval since annotations/decorators over methods might have not worked.

It seems like the main issue, for many people running different models is Batch_size. Basically the size of data you put on the GPU is making it run out of memory. This is the current data size. Not sure if it is big or there are other datasets bigger than this, that work just fine due to sampling or data loading into batches.

HeteroData(
  node1={
    x=[308696, 439],
    num_nodes=308696
  },
  node2={
    x=[2280576, 64],
    num_nodes=2280576
  },
  node3={
    x=[69, 75],
    num_nodes=69
  },
  (node2, to, node1)={
    edge_index=[2, 8743140],
    edge_attr=[8743140, 4],
    edge_label=[8743140, 1]
  },
  (node3, to, node2)={
    edge_index=[2, 8743140],
    edge_attr=[8743140, 18],
    edge_label=[8743140, 1]
  }
)

For now, my train loop doesn't have any batching(I followed some examples) so i can't directly vary batch_size to test. I am planning on making it , but i'm sure that will solve the CUDA memory issue. Nothing else has changed and i've tested the same on a larger VM with more memory and gpus

Problem

I haven't been able to sample or use any DataLoader for this HeteroData, since there was nothing that splits on edges. Most of them were node based. To make a train test split, I had to use indexing and slice the edge_index, edge_attr and edge_label sequentially since RandomLinkSplit wouldn't have worked in this case. I have temporal features on the edge attributes and the goal was to learn from one time period and generalize to a future time period. So it works fine. I don't mind some sequential correlation and wanted to someday merge with Pytorch geometric temporal to test few theories.

Replicate

If you want to replicate a smaller scale graph

from torch_geometric.data import HeteroData
from torch_geometric.data.collate import collate


def get_edges(x,y):
    src = np.random.randint(0,x,350)
    dest = np.random.randint(0,y, 350)
    return [src,dest]

g1 = HeteroData() 
g1["node1"].x = torch.tensor(np.round(np.random.rand(125,6)*10),dtype=torch.float)
g1["node2"].x = torch.tensor(np.round(np.random.rand(200,6)*20),dtype=torch.float)
g1["node3"].x = torch.tensor(np.round(np.random.rand(10,6)*10),dtype=torch.float)

g1['node2', 'to', 'node1'].edge_index = torch.tensor(get_edges(200,125),dtype=torch.float)
g1['node3', 'to', 'node2'].edge_index = torch.tensor(get_edges(10,200),dtype=torch.float)

g1['node2', 'to', 'node1'].edge_attr = torch.tensor(np.round(np.random.rand(350,6)*10),dtype=torch.float)
g1['node3', 'to', 'node2'].edge_attr = torch.tensor(np.round(np.random.rand(350,5)*10),dtype=torch.float)

g1["node2", "to", "node1"].edge_label = torch.rand((350,1))
g1["node3", "to", "node2"].edge_label = torch.rand((350,1))


node_types, edge_types = g1.metadata()
for node_type in node_types:
    g1[node_type].num_nodes = g1[node_type].x.size(0)

g1, slices,_ = collate(g1.__class__,data_list = [g1],
                      increment=False,
                      add_batch=False,)
g1

Solution?

(See Clarification first, to know/view how the data object was made and please do correct me if i made any errors or missed something. I feel i did)

I can see 2 ways to batch it

Similar to how I only split edges, do I now create batches within the train and test data?
The model will still get the full x_dict and a smaller (edge_index_dict, edge_attr_dict)

a) I can do it like this maybe ? and only use those indices to access and make a runtime edge_dict, but it seems inefficient

from torch.utils.data import DataLoader
edge_types = g3.edge_types
batch_size=3
permlist = []
for perm in DataLoader(range(15), batch_size,shuffle=True):
    permlist.append(perm)
permlist
>>> [tensor([11,  9,  0]),
 tensor([ 4, 13,  3]),
 tensor([10,  5, 14]),
 tensor([ 7,  6, 12]),
 tensor([8, 2, 1])]

b) Every time I try using a loader from the set of available and relevant options, or just the base one, i get an error like this which makes me think there is something wrong with my graph creation class, or some method i'm missing. Please let me know if that is the case. I have attached a template of how i made the Dataset, following all the tutorials and documentation

from torch_geometric.loader import DataLoader
loader = DataLoader(g1, batch_size=2, shuffle=True)
for batch in loader:
    batch

Should I give up and split my data into 365 small graphs in a data list? Naturally I’ll add them in a list inside the process method and collate it , so i will have a data object and index it to get different graphs and feed those directly in a loop during training? Ideally i know it makes no difference to the model if it sees the whole year's data at once or sees daily connections at once, but i'm planing on using the trained embeddings of the whole data from the model for a future clustering purpose. So i made it in a way that each node entity has a unique index and i can choose to view a subset of this.

This feels like the saner option, but will require code changes and rerun(not a big deal) and will also help me tag train and test graphs easily. Each graph will be individually small enough to run without an issue.

Should i use Batch option within the InMemoryDataset Class somehow? or just let it collate
Within Collate there are parameters for Increment and Batch, which can be set to True. What benefit does this give in terms of loading

g1, slices,_ = collate(g1.__class__,data_list = [g1],
                      increment=False,
                      add_batch=False,)

Help Required in

I would like to request some help in determining

1.b , if something is wrong with my dataset construction which makes loaders not work?like some missing methods within the class
2, if I create a dataset as a list of graphs, what is the best way to loop it into batches during training? just use a for loop or a loader? Should i use the Batch option after the dataset list is generated or during the process?
If there is a way to create transforms that will do minmax scaling of the whole list instead of individual graphs? is it not even necessary ? would it apply to all graphs in the list or does that need to be looped? Or should these be added as transforms and pre_transforms during creation ?

Clarification

Let me know if I'm missing anything . I am going by a list of csv files and instead of making a Data list object with multiple graphs , i've combined everything to make one big graph. Think 1 yearly graph instead of 365 daily graphs. the reason behind this was that i wanted to standardize and use a minmax scaler for feature values and i couldn't see a transform that does this. Normalization is across row features but i wanted to scale each feature value and doing it on whole data makes sense rather than on subsets. Correct me if i'm wrong and if there is a way to transform data object with features of varying size to standardize it after making the data object.

Another reason for ding a yearly graph is to have node entities and indexes easily

This is how i've constructed my data. Let me know if i'm missing any methods

class IdentityEncoder(object):
    def __init__(self, dtype=None):
        self.dtype = dtype

    def __call__(self, df):
        return torch.from_numpy(df.values).view(-1, 1).to(self.dtype)


class NetworkDataset(InMemoryDataset):
    def __init__(self, root_dir,url='',transform=None, pre_transform=None):
        """"""
        self.root = root_dir
        self.url = url
        super(NetworkDataset, self).__init__(root_dir, transform, pre_transform)
        self.data, self.slices = torch.load(self.processed_paths[0])

    @property
    def raw_dir(self):
        return osp.join(self.root, 'raw')

    @property
    def processed_dir(self):
        return osp.join(self.root, 'processed')

    @property
    def raw_file_names(self):
        '''raw files loc'''

    @property
    def processed_file_names(self):
        return 'data.pt'


    def download(self):
        ''''''
        Code to download processed or raw files from a location
            
    def process(self):      
    
        # Get node features
        # Loading Node mappings and their features
        x1, idx_mapping = self.load_node_csv(node_feats, mapping_path)
        edge_index,edge_attrs,edge_labels= self.get_edges_from_files(idx_mapping)

        # Done 3 times for 3 nodes

        # Create data object

        data = HeteroData()
        data['node1'].x = x1  
        data['node2'].x = x2 
        data['node3'].x = x3  

        for node_type in ['node1', 'node2', 'node3']:
            data[node_type].num_nodes = data[node_type].x.size(0)

       
        data['node1', 'to', 'node3'].edge_index = torch.cat(edge_indices,1)  # [2, num_edges]
        data['node1', 'to', 'node3'].edge_attr = torch.tensor(minmax_scale(torch.cat(edge_attrs,0)), dtype=torch.float)  
        data['node1', 'to', 'node3'].edge_label = torch.cat(edge_labels, 0)  # [num_edges,1]
           

        # Apply the functions specified in pre_filter and pre_transform
        if self.pre_filter is not None:
            data = self.pre_filter(data)

        if self.pre_transform is not None:
            data = self.pre_transform(data)

        # Store the processed data
        torch.save(self.collate([data]), self.processed_paths[0])

    def load_node_csv(self, featpath, idxpath):
        """
        This will return a matrix / 2d array of the shape
        [Number of Nodes, Node Feature size]
        """
        df = pd.read_csv(featpath)
        map_df = pd.read_csv(idxpath)
        mapping = dict(zip(map_df["ent_name"], map_df["ent_idx"]))
        x = torch.tensor(df.values, dtype=torch.float)
        return x, mapping

    def load_edge_csv(self, edge_file_path, edge_cols, src_index_col, src_mapping, dst_index_col, dst_mapping,
                      encoders=None):
        """
        This will return a matrix / 2d array of the shape
        [Number of edges, Edge Feature size]
        """
        df = pd.read_csv(edge_file_path)
        src = []
        dst = []
        for index, row in df.iterrows():
            try:
                s = src_mapping[row[src_index_col]]
                d = dst_mapping[row[dst_index_col]]
            except:
                df.drop(index, inplace=True)
                # print("Missed a key")
                continue
            src.append(s)
            dst.append(d)
        edge_index = torch.tensor([src, dst])

        edge_attr = df[edge_cols]
        edge_attr = torch.tensor(edge_attr.values, dtype=torch.float)

        edge_label = None
        if encoders is not None:
            edge_label = [encoder(df[col]) for col, encoder in encoders.items()]
            edge_label = torch.cat(edge_label, dim=-1)

        return edge_index, edge_attr, edge_label

    def get_edges_from_files(self, cust_idx_mapping, ship_idx_mapping, prod_idx_mapping):
        """
        """
        edge_attrs = []
        edge_labels = []
        edge_indices = []
        
        edges = []

        file_list = self.raw_file_names
        # time_steps = len(file_list)    
        # train_files, validate_files, test_files = np.split(file_list, [int(.8 * time_steps), int(.9 * time_steps)])

        for file_name in tqdm(file_list):
            edge_file_path = file_name
            print(edge_file_path)
            # this will be done twice for 2 edges
            edge_index, edge_attr, edge_label = self.load_edge_csv(edge_file_path,
                                                                               edge_cols,
                                                                               src_index_col="node1",
                                                                               src_mapping=idx_mapping1,
                                                                               dst_index_col="node2",
                                                                               dst_mapping=idx_mapping2,
                                                                               encoders={'label': IdentityEncoder(
                                                                                   dtype=torch.long)})
            edges.append(len(sdc_edge_label))
        return edge_indices, edge_attrs, edge_labels


    def __repr__(self) -> str:
        return 'network1()'

Please do let me know if you require any additional details or would like to discuss . In the mean time i will be testing different solutions to see if i can make it work temporarily with the hacks

rusty1s · 2022-06-09T09:19:04Z

rusty1s
Jun 9, 2022
Maintainer

We just added the LinkNeighborLoader to PyG, which should fit your need perfectly. Let me know if you need any help in applying it your problem!

3 replies

snknitin Jun 9, 2022
Author

Thank you so much for your prompt response. I will test it right away. Do i need to update the version?

snknitin Jun 9, 2022
Author

Following the example snippet

from torch_geometric.loader import NeighborLoader

loader = LinkNeighborLoader(
    data_1,
    # Sample 30 neighbors for each node for 2 iterations
    num_neighbors=[30] * 2,
    # Use a batch size of 128 for sampling training nodes
    batch_size=128,
    edge_label_index=data.edge_index,
)

sampled_data = next(iter(loader))
print(sampled_data)

gives

NameError: name 'LinkNeighborLoader' is not defined

and if i change it to from torch_geometric.loader import NeighborLoader
it says

ImportError: cannot import name 'LinkNeighborLoader' from 'torch_geometric.loader' (/Users/n0s011m/.conda/envs/mobius-data-science/lib/python3.7/site-packages/torch_geometric/loader/__init__.py)

I tried pip install git+https://github.com/pyg-team/pytorch_geometric.git to make it work

rusty1s Jun 9, 2022
Maintainer

Yes, you need to install from source: https://github.com/pyg-team/pytorch_geometric#nightly-and-master

snknitin · 2022-06-10T04:28:45Z

snknitin
Jun 10, 2022
Author

Hi, Sorry I'm extremely confused about the parameters to pass. I've looked into the source and debugged multiple variations and keep getting different assertion errors each time. Would you be able to help me frame the loader to my problem?

For your convenience I've updated the discussion above , under section Replicate, with a code for creating a small scale data object (g1), which looks similar to my HeteroData object. I have 2 edges, between 3 nodes and both edges have labels since i'm doing a multi-output regression with them. I'm not sure what i'm doing wrong here or how to pass my heterodata since there is no example to look at.

kwargs = {'batch_size': 1024, 'num_workers': 6, 'persistent_workers': True}
num_neigh= {key: [30] * 2 for key in data.edge_types}
loader = LinkNeighborLoader(
    data=g1,
    num_neighbors= ???,
    edge_label_index= ???,
    edge_label= ???,
    replace =False,
    directed = True,
    is_sorted = False,
    neighbor_sampler=None,
    **kwargs)

How would these be framed if the data object is a graph like g1
num_neighbors= ???,
edge_label_index= ???,
edge_label= ???,

Editing this as a separate comment.
I read from another comment that i probably won't need neg_sampling_ratio since i'm not dealing with non-existent links ...yet.

If possible could you also take a look at the help required in section for those 3 questions. would love to get your feedback on the dataset construction and custom transform

4 replies

rusty1s Jun 10, 2022
Maintainer

num_neighbors just denotes a list of number of neighbors to sample across each hop and each edge type. In your case, you can simply do something like num_neighbors=[30, 30], which means that you sample 30 neighbors over two hops for each edge type. edge_label_index is a tensor of shape [2, num_seed_edges] that contains the seed edges you want to iterate over. If you want to iterate over all edges present in your data, you can simple insert data.edge_index. edge_label contains the label of each edge that you iterate over, e.g., edge_label.size(0) == edge_label_index.size(1).

snknitin Jun 10, 2022
Author

yes, i figured the num_neighbors = {key: [30] * 2 for key in data.edge_types} from another example on the documentation page.

I tried the edge_label_index=data.edge_index before but it gives
{AttributeError}'HeteroData' has no attribute 'edge_index'

if i give it the data.edge_index_dict it fails again because it expects a tuple of (str, Tensor)
For the str if give the edge keys from data.edge_types it goes on to canonicalize and adds a "to" between both edges [(str,str,str) , "to",(str,str,str) ] and verifies that with edgetypes and fails at assertions

I do want to use both edges and both labels to batch the data into smaller graphs. Do i pass them as a list or a dict?

i didn't quite understand the edge_label part. if it need label for each edge, do i pass the edge_label_dict?

if this was the graph(incase you want to replicate or test the error) , how would i have to frame the inputs to the LinkSplitLoader

from torch_geometric.data import HeteroData
from torch_geometric.data.collate import collate


def get_edges(x,y):
    src = np.random.randint(0,x,350)
    dest = np.random.randint(0,y, 350)
    return [src,dest]

g1 = HeteroData() 
g1["node1"].x = torch.tensor(np.round(np.random.rand(125,6)*10),dtype=torch.float)
g1["node2"].x = torch.tensor(np.round(np.random.rand(200,6)*20),dtype=torch.float)
g1["node3"].x = torch.tensor(np.round(np.random.rand(10,6)*10),dtype=torch.float)

g1['node2', 'to', 'node1'].edge_index = torch.tensor(get_edges(200,125),dtype=torch.float)
g1['node3', 'to', 'node2'].edge_index = torch.tensor(get_edges(10,200),dtype=torch.float)

g1['node2', 'to', 'node1'].edge_attr = torch.tensor(np.round(np.random.rand(350,6)*10),dtype=torch.float)
g1['node3', 'to', 'node2'].edge_attr = torch.tensor(np.round(np.random.rand(350,5)*10),dtype=torch.float)

g1["node2", "to", "node1"].edge_label = torch.rand((350,1))
g1["node3", "to", "node2"].edge_label = torch.rand((350,1))


node_types, edge_types = g1.metadata()
for node_type in node_types:
    g1[node_type].num_nodes = g1[node_type].x.size(0)

g1, slices,_ = collate(g1.__class__,data_list = [g1],
                      increment=False,
                      add_batch=False,)
g1

snknitin Jun 13, 2022
Author

Hey @rusty1s , I debugged the errors I got with different inputs and finally figured what would work for HeteroData.

idxs= list(data.edge_index_dict.items())
#loader = DataLoader(data, batch_size=4)
loader = LinkNeighborLoader(data,
                            num_neighbors={key: [30] * 2 for key in data.edge_types},
                            edge_label_index=[data.edge_types[0],data[data.edge_types[0]].edge_index],
                            edge_label=data[data.edge_types[0]].edge_label,
                            batch_size=1024)
sampled_hetero_data = next(iter(loader))
print(sampled_hetero_data)

data here is the g1 graph structure i mentioned in the comments above. It's an InmemoryDataset, so it doesn't have a len() and get()

Couple questions. Do we ideally run this loader before we do transforms where we make it undirected and add reverse edges, or afterwards, right before training ? I'm assuming afterwards since most examples just do loaders and batching in the train loop

However, I have 2 edge types , with labels and attributes for each . In this loader , i think i can only give in one label and one edge_label_index. If i try any other way to include both edges, it gives me errors and mostly fails in the edge_type,edge_label_index = get_edge_types(...) part due to the data canonicalizer which basically looks are the keys of edge_types in the data and args passes and considers them strings and checks [0],[-1] matches. If i pass both edge_types , this is the value at that point in the code ((str,str,str),(str,str,str)) and it creates [((str,str,str),"to", (str,str,str)] and asserts it with data.edge_types and fails

I get the sampled batch data, with different neighbour nodes and correct number of edges/labels, but only for one edge type. the other edge_type dimensions change completely. Not sure if that would affect my network architecture since i concatenate both these edges and use them for regression , but basically it will be different edges. the two edge_types for me were meant to showcase different aspects of the same transaction. So, i would ideally need both dimensions to be same, ergo both edge_types to be considered for LinkNeighbor Loader. Is there a way to pass the input in a way to get that result and fail on Assertion Errors. edge_label_index expects to be an instance of (list,tuple). Any other case failed for me despite the Cora example working with just the Tensor.

rusty1s Jun 13, 2022
Maintainer

Sorry for the late reply and glad that you resolved it on your own.

Any modifications (adding reverse edges, ...) should be done before putting the data object into the loader. Otherwise, your samples may be corrupted since you would not use the reverse edges for sampling.
Yes, we currently only support edge_label_index across a single edge type. What you can do is to merge both edge types together (assuming they are between the same source and destination node type). Better support for this is tracked in [Roadmap] Support multiple node/edge type sampling using NeighborLoader and friends #4765.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Avoid CUDA Memory Error - InMemory HeteroData - Batching and Data Split issue #4785

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 7 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How to Avoid CUDA Memory Error - InMemory HeteroData - Batching and Data Split issue #4785

snknitin Jun 9, 2022

Problem

Replicate

**Solution? **

Help Required in

Clarification

Replies: 2 comments · 7 replies

rusty1s Jun 9, 2022 Maintainer

snknitin Jun 9, 2022 Author

snknitin Jun 9, 2022 Author

rusty1s Jun 9, 2022 Maintainer

snknitin Jun 10, 2022 Author

rusty1s Jun 10, 2022 Maintainer

snknitin Jun 10, 2022 Author

snknitin Jun 13, 2022 Author

rusty1s Jun 13, 2022 Maintainer

snknitin
Jun 9, 2022

Solution?

Replies: 2 comments 7 replies

rusty1s
Jun 9, 2022
Maintainer

snknitin Jun 9, 2022
Author

snknitin Jun 9, 2022
Author

rusty1s Jun 9, 2022
Maintainer

snknitin
Jun 10, 2022
Author

rusty1s Jun 10, 2022
Maintainer

snknitin Jun 10, 2022
Author

snknitin Jun 13, 2022
Author

rusty1s Jun 13, 2022
Maintainer