Skip to content

[BUG]: Memory leak(?) when using batching with large dataset (>450K items) #706

@sirisian

Description

@sirisian

What happened?

It seems like when letting PySR run forever after a while it gets an OOM error after a while, but only when using a large dataset. I can watch it steadily grow memory.

I used the following setup in each of my tests changing the sample_size 10000 value to a specific value or commenting the three lines out to run the whole dataset:

import numpy as np
import json
from pysr import PySRRegressor

with open('test2.json', 'r') as file:
    data = json.load(file)

data_array = np.array(data)

sample_size = min(10000, len(data_array))
random_indices = np.random.choice(len(data_array), size=sample_size, replace=False)
data_array = data_array[random_indices]

# Split the array into X and y
X = data_array[:, :5]
y = data_array[:, 5]

model = PySRRegressor(
    procs=14,
    populations=32,
    population_size=200,
    ncycles_per_iteration=1000,
    niterations=10000000,
    binary_operators=["-", "+", "*", "/", "^"],
    unary_operators=[
        "square",
    ],
    nested_constraints={
        "square": {"square": 1 }
    },
    constraints={
        "^": (-1, 8),
    },
    elementwise_loss="""
loss(prediction, target) = try 
    trunc(Int, prediction) == trunc(Int, target) ? 0 : (prediction - target)^2
catch
    (prediction - target)^2
end
""",
    maxsize=40,
    maxdepth=10,
    complexity_of_constants=2,
    batching=True,
    batch_size=200,
    heap_size_hint_in_bytes=200000000,
)

model.fit(X, y)

My dataset has 460K records which I know isn't advised, but it's for a niche problem. The memory issues appears to only happen when running on a dataset over a certain size.

I saw a comment about heap_size_hint_in_bytes needing to be set and I've played with values for it, but it doesn't appear to change the behavior. I've set it to 1.2 GB and also 200MB for instance. I've tried smaller batch sizes like 50 and it doesn't appear to change the behavior either. None of the other settings appear to change things either. I've tried procs=8 and smaller populations and population_size, and smaller ncycles_per_iteration.

100K random records then WSL starts at 3.85 GB. At 10 minutes 4GB. At 50 minutes 4.3 GB. At 1 hour 3.2 GB. At 1.5 hours 3 GBs. No issues.

200K random records then WSL starts at ~4GB. At 20 minutes 4.2GB, At 1 hour 3.6GB. At 2 hours 15 minutes 3.5GB. No issues.

300K random records then WSL starts at ~4.3GB. At 20 minutes 5.3GB. Grew to 5.9 GBs then dropped to 5.2GB at 30 minutes. 1 hour 6GB. 1 hour 15 minutes 5.4GB. 1 hour 30 minutes 4.7GB. 1 hour 32 minutes 4.4GB. No issues.

400k random records then WSL starts at ~4.2GB. 3 minutes 4.5GB. 8 hours 30 minutes 7.7GB.

460K then WSL starts at ~4.8GB. 2 minutes 5.2GB. 30 minutes 9.6GB, 40 minutes 12.3GB. 55 minutes 14.5 GB. 1 hour 15.1GB. 1 hour 10 minutes 15.4GB. 1 hour 19 minutes 15.5 GB. 1 hour 26 minutes OOM. I ran this also using:

sample_size = min(460309, len(data_array))

Just to be sure and it failed at 1 hour 23 minutes so no difference.

I've attached my test2.json file.
test2.json

Version

0.19.4

Operating System

Linux

Package Manager

pip

Interface

IPython Terminal

Relevant log output

No response

Extra Info

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions