-
Notifications
You must be signed in to change notification settings - Fork 273
Description
What happened?
It seems like when letting PySR run forever after a while it gets an OOM error after a while, but only when using a large dataset. I can watch it steadily grow memory.
I used the following setup in each of my tests changing the sample_size 10000 value to a specific value or commenting the three lines out to run the whole dataset:
import numpy as np
import json
from pysr import PySRRegressor
with open('test2.json', 'r') as file:
data = json.load(file)
data_array = np.array(data)
sample_size = min(10000, len(data_array))
random_indices = np.random.choice(len(data_array), size=sample_size, replace=False)
data_array = data_array[random_indices]
# Split the array into X and y
X = data_array[:, :5]
y = data_array[:, 5]
model = PySRRegressor(
procs=14,
populations=32,
population_size=200,
ncycles_per_iteration=1000,
niterations=10000000,
binary_operators=["-", "+", "*", "/", "^"],
unary_operators=[
"square",
],
nested_constraints={
"square": {"square": 1 }
},
constraints={
"^": (-1, 8),
},
elementwise_loss="""
loss(prediction, target) = try
trunc(Int, prediction) == trunc(Int, target) ? 0 : (prediction - target)^2
catch
(prediction - target)^2
end
""",
maxsize=40,
maxdepth=10,
complexity_of_constants=2,
batching=True,
batch_size=200,
heap_size_hint_in_bytes=200000000,
)
model.fit(X, y)
My dataset has 460K records which I know isn't advised, but it's for a niche problem. The memory issues appears to only happen when running on a dataset over a certain size.
I saw a comment about heap_size_hint_in_bytes needing to be set and I've played with values for it, but it doesn't appear to change the behavior. I've set it to 1.2 GB and also 200MB for instance. I've tried smaller batch sizes like 50 and it doesn't appear to change the behavior either. None of the other settings appear to change things either. I've tried procs=8 and smaller populations and population_size, and smaller ncycles_per_iteration.
100K random records then WSL starts at 3.85 GB. At 10 minutes 4GB. At 50 minutes 4.3 GB. At 1 hour 3.2 GB. At 1.5 hours 3 GBs. No issues.
200K random records then WSL starts at ~4GB. At 20 minutes 4.2GB, At 1 hour 3.6GB. At 2 hours 15 minutes 3.5GB. No issues.
300K random records then WSL starts at ~4.3GB. At 20 minutes 5.3GB. Grew to 5.9 GBs then dropped to 5.2GB at 30 minutes. 1 hour 6GB. 1 hour 15 minutes 5.4GB. 1 hour 30 minutes 4.7GB. 1 hour 32 minutes 4.4GB. No issues.
400k random records then WSL starts at ~4.2GB. 3 minutes 4.5GB. 8 hours 30 minutes 7.7GB.
460K then WSL starts at ~4.8GB. 2 minutes 5.2GB. 30 minutes 9.6GB, 40 minutes 12.3GB. 55 minutes 14.5 GB. 1 hour 15.1GB. 1 hour 10 minutes 15.4GB. 1 hour 19 minutes 15.5 GB. 1 hour 26 minutes OOM. I ran this also using:
sample_size = min(460309, len(data_array))
Just to be sure and it failed at 1 hour 23 minutes so no difference.
I've attached my test2.json file.
test2.json
Version
0.19.4
Operating System
Linux
Package Manager
pip
Interface
IPython Terminal
Relevant log output
No response
Extra Info
No response