Encountered a CUDA error and edge index error #6672

amalislam675 · 2023-02-11T04:51:43Z

amalislam675
Feb 11, 2023

I am running the PYG to train my model. I am facing this error. Is there any fix?

RuntimeError Traceback (most recent call last)
//user1/.conda/envs/PY37_1/lib/python3.7/site-packages/torch_geometric/nn/conv/message_passing.py in lift(self, src, edge_index, dim)
238 index = edge_index[dim]
--> 239 return src.index_select(self.node_dim, index)
240 except (IndexError, RuntimeError) as e:

RuntimeError: CUDA out of memory. Tried to allocate 25.04 GiB (GPU 0; 23.65 GiB total capacity; 1.73 GiB already allocated; 21.06 GiB free; 1.84 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

During handling of the above exception, another exception occurred:

ValueError Traceback (most recent call last)
/tmp/ipykernel_90168/3092512400.py in
----> 1 model.forward(train_data["feature1"], train_data["edge_index1"], train_data["edge_weight1"])

/tmp/ipykernel_90168/1301479275.py in forward(self, feature1, edge_index1, edge_weight1)
23 x = F.elu(x)
24 x = F.dropout(x, p=0.6, training=self.training)
---> 25 x = self.conv2(x, edge_index1, edge_weight1)
26
27 return F.log_softmax(x, dim=1)

//user1/.conda/envs/PY37_1/lib/python3.7/site-packages/torch/nn/modules/module.py in _call_impl(self, *input, **kwargs)
1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1129 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130 return forward_call(*input, **kwargs)
1131 # Do not call functions when jit is used
1132 full_backward_hooks, non_full_backward_hooks = [], []

/tmp/ipykernel_90168/3881736389.py in forward(self, x, edge_index, edge_attr, return_attention_weights)
130 # propagate_type: (x: PairTensor, edge_attr: OptTensor)
131 out = self.propagate(edge_index, x=(x_l, x_r), edge_attr=edge_attr,
--> 132 size=None)
133
134 alpha = self._alpha

//user1/.conda/envs/PY37_1/lib/python3.7/site-packages/torch_geometric/nn/conv/message_passing.py in propagate(self, edge_index, size, **kwargs)
428
429 coll_dict = self.collect(self.user_args, edge_index,
--> 430 size, kwargs)
431
432 msg_kwargs = self.inspector.distribute('message', coll_dict)

//user1/.conda/envs/PY37_1/lib/python3.7/site-packages/torch_geometric/nn/conv/message_passing.py in collect(self, args, edge_index, size, kwargs)
299 if isinstance(data, Tensor):
300 self.set_size(size, dim, data)
--> 301 data = self.lift(data, edge_index, dim)
302
303 out[arg] = data

//user1/.conda/envs/PY37_1/lib/python3.7/site-packages/torch_geometric/nn/conv/message_passing.py in lift(self, src, edge_index, dim)
241 if 'CUDA' in str(e):
242 raise ValueError(
--> 243 f"Encountered a CUDA error. Please ensure that all "
244 f"indices in 'edge_index' point to valid indices "
245 f"in the interval [0, {src.size(self.node_dim)}) in "

ValueError: Encountered a CUDA error. Please ensure that all indices in 'edge_index' point to valid indices in the interval [0, 128) in your node feature matrix and try again.

rusty1s · 2023-02-11T20:39:37Z

rusty1s
Feb 11, 2023
Maintainer

Looks like you are experiencing a CUDA OOM error. How big is your data? :)

6 replies

rusty1s Feb 12, 2023
Maintainer

Mh, that's actually not that large, wondering why it tries to create a 24GB sized tensor. How does your model look like? Does data.validate() run through?

amalislam675 Feb 12, 2023
Author

The data types which Pytorch Geometric accepts, I converted all of the data into these shapes like x(features) into Float Tensor, edge_index(edge index of adjacency matrix) into Long Tensor and edge_weights (edge values of adjacency matrix). After creating a model when I pass the data into model object, it gives me error. It doesnot eve complete one training epoch. Basically, this error is in forward function when it start running.

rusty1s Feb 13, 2023
Maintainer

Yes, that sounds good. Can you still run data.validate() on your Data object? My current assumption is that edge_index may hold very large/invalid indices, which may explain the OOM.

amalislam675 Feb 13, 2023
Author

I will try to run data.validate( ).

amalislam675 Feb 22, 2023
Author

Now, works fine after changing the GPU. Problem is associated with my GPU. Its memory was full.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Encountered a CUDA error and edge index error #6672

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Encountered a CUDA error and edge index error #6672

Uh oh!

amalislam675 Feb 11, 2023

Replies: 1 comment · 6 replies

Uh oh!

rusty1s Feb 11, 2023 Maintainer

Uh oh!

rusty1s Feb 12, 2023 Maintainer

Uh oh!

amalislam675 Feb 12, 2023 Author

Uh oh!

rusty1s Feb 13, 2023 Maintainer

Uh oh!

amalislam675 Feb 13, 2023 Author

Uh oh!

amalislam675 Feb 22, 2023 Author

amalislam675
Feb 11, 2023

Replies: 1 comment 6 replies

rusty1s
Feb 11, 2023
Maintainer

rusty1s Feb 12, 2023
Maintainer

amalislam675 Feb 12, 2023
Author

rusty1s Feb 13, 2023
Maintainer

amalislam675 Feb 13, 2023
Author

amalislam675 Feb 22, 2023
Author