-
-
Notifications
You must be signed in to change notification settings - Fork 136
Using Node LLama CPP in a Child Process #481
-
Hello! I have the need for loading a model in a child process. Since I have been having issues with that I striped my code down into literally just the importing: import {
getLlama,
LlamaChatSession,
Llama,
LlamaModel,
LlamaContext,
} from "node-llama-cpp"; This is the only thing in the child.js file that is being spawned like this: this._child = spawn("node", [CHILD_PATH], {
stdio: ["inherit", "pipe", "pipe", "ipc"],
detached: false,
}); where CHILD_PATH comes from: const CHILD_PATH = resolve(__dirname, "./child.js"); The child process hangs when I do this. I set it up to where it will print a line before and after the import statement, (turns out imports are handled before console.log statements) so I put it inside a lazy load and what I see is that it prints the first line before the import but not the second line after the import. Can anyone shed some light on this? Why does it not work? My Goal:The reason I want it in a child process is because unloading a model programmatically has been a pain. I don't know how it has been over the last few months but a few months ago I had concluded through github forum research that the best way to unload the model is having it inside a worker and just killing the thread. However, I opted to moving to a child process instead of a worker thread because the worker thread would freeze the main thread upon loading a model. Where as the child process shouldn't given the nature of child processes being non blocking. So while I would love to be able to get this working in a child process, I also even more so would love to just have a way to not block the main process while loading, and the be able to unload a model instantly and in a guaranteed way! |
Beta Was this translation helpful? Give feedback.
All reactions
Replies: 1 comment · 12 replies
-
Please share the output of running this command, so I can get a sense of your environment: npx --yes node-llama-cpp inspect gpu You can Why did you land on using another process/thread for that? |
Beta Was this translation helpful? Give feedback.
All reactions
-
Thanks for taking the time to run all these tests! The Vulkan build seems to fail since you don't have the Vulkan SDK installed. Which model did you use in these tests? So I think there are 2 different issues here:
To help me with with npx --no node-llama-cpp inspect measure --gpu cuda <gguf path> It'd also help if you can also run this command with other models you have that work without issues, so I can use them as a baseline. To help me with Regarding the child process, I think you can safely discard that approach, since Thank you again for helping me debug this and get these issues solved :) |
Beta Was this translation helpful? Give feedback.
All reactions
-
Oh yes the Vulkan SDK! I forgot that Vulkan would also have an SDK I just use CUDA. The model I used is llama3.1 8b fine tuned by me with that unsloth collab. Would that muddy the results? Here is running that command with 3 separate models: Llama3.1 8b Q4_0
Phi 2 Q8_0
NPCAgent (My Fine Tuned model off of llama3.1:8b)
|
Beta Was this translation helpful? Give feedback.
All reactions
-
As for As I am somewhat strapped for time, and I think if I just use the main process without worker threads or child process, as you said it should work fine. I have learned a lot in this conversation so I feel that the need for figuring out how to prevent lag does not out weight the immediate fixes that I can setup in my app. Like the context size thing might be a quick fix I can do although I know that each users computer will be different! So it seems I can only help you with Also so im my experience as stated in the other tests, setting the context size limit made it so it loaded without allocation errors, and was able to dispose of correctly. So Maybe you can see about making it so that if a model does not allocate correctly it will still be disposable? I am talking about these messages as they are the only indicator to my small little mind on when it loads a model incorrectly:
I am also curious about why there is a dispose function on the llama instance. Do I just called async unloadModel() {
if (this.model != null) {
await this.model.dispose();
await this.llama.dispose();
}
this.llama = null;
this.model = null;
this.context = null;
this.session = null;
} Please feel free to message me directly on github, perhaps we can even chat on discord in the future! |
Beta Was this translation helpful? Give feedback.
All reactions
-
I see now that the memory usage estimation for a context with CUDA on your machine is off by too much, which seems to be the reason why it fails to get created without limiting the context size. The log messages you saw about CUDA actually come from the native code of Regarding the freeze you described, I think this happened since it tried to allocate all the memory that the GPU has and didn't leave space for anything else, which caused the system to try and unload things from the GPU and resulted in the freeze you experienced. Did the phi2 model also cause a freeze on your machine? If you disable You can call HMU on email so we can exchange Discord handles if you like! |
Beta Was this translation helpful? Give feedback.
All reactions
-
Awesome thank you for the insights! The phi2 model does freeze sometimes. But I just tried to replicate it and it didn't. I am still trying to figure out when and how it freezes as its seemingly random. So when I The padding thing sounds useful thank you for that! As for the mmap for phi here is the results: DISABLED
ENABLED
I am not experiencing and lag for both of these as of right now. Also yes I will email you now thank you! |
Beta Was this translation helpful? Give feedback.
Please share the output of running this command, so I can get a sense of your environment:
You can
await model.dispose()
to unload a model on demand, or just let it be garbage collected to get it unloaded automatically.The loading of the model is also asynchronous, but it may take about ~100ms of the main thread to read the model’s metadata from the JS side if the metadata is extremely big, but this only happens once.
Why did you land on using another process/thread for that?
On
node-llama-cpp
v2 the loading and unloading of models used to be sync, so having another process for that back then made sense, but this is not the case anymore since v3.Perh…