Replies: 5 comments 9 replies
-
Celery Queue ProposalThe current split into queues has its reasons, and changing that might cause issues. I've started to write the reasoning here, but then I realized it better fits into the documentation, so it is in #15496. Your changes will make the observed performance much worse because it combines tasks that are not that important and can take long (like translation memory updates or automatic translation) with regular ones. There might be a better way to split the tasks (the queues were not revisited for years), but your proposal is making things worse. Prefetch-MultiplierI don't think this will make a visible impact, but it can probably be lowered for For OOM KillsThe root cause needs to be addressed here. We've considered Pool TypeI believe that thread-based concurrency is the way to go. Actually, our Docker container uses this since the 5.12 release (WeblateOrg/docker#3371). We should probably update the main documentation and examples to be consistent with that. We've experimented with eventlets as well, but some of our dependencies are not safe in this environment. The Rescue: AutoScale
If your Weblate instance is not much used, this can be observed, but this is not generally true. Autoscaling can make your system behave badly under higher load because it spawns additional processes in that time. This can also be another cause for OOM because each of the newly spawned processes. Overall, autoscaling will cause less memory consumption when the system is not busy, but there will be no difference when it is busy. If your Weblate server is not having much of a load, you can reduce the number of workers permanently; the provided configuration is just an example. |
Beta Was this translation helpful? Give feedback.
-
Let's get at the core before I respond to the smaller bits - and look at an example, right from the newly added documentation:
This is a perfect example to show why the current categorization does not make sense. There are currently 7 tasks which are executed on the Digest SendingThese tasks are sending e-mails which do not have any time constraints. Whether they are sent some hours earlier or later doesn't matter at all. They can be sent nightly and can be executed on a queue with a single worker - together with many other tasks of that kind. The keeps the load for these things low and spread over a longer range of time instead of causing bursts at certain points in time.
Change NotificationsThese three are about sending change notifications. Users expect such notifications to arrive in a timely manner, but when it would sometimes happen that it takes a number of minutes until they arrive, it's still fine.
E-Mails for direct User InteractionMails sent by this task are important and need to be sent as soon as possible.
You wrote:
"smooth delivery" is very important indeed - but that applies to just one of those tasks - not to the other six ones. It doesn't make sense to put all those tasks into the same queue just because they are all sending e-mails. Each task needs to be triaged and categorized individually - that's the core of my proposal and the things I'm saying here apply to other queues just in the same way. |
Beta Was this translation helpful? Give feedback.
-
4 celery workers All working in parallel - with all queues continuously filled? That's not a realistic case. I repeat: This NEVER happens... Hence, that is not a case to even put under consideration (except trying to prevent the worst). A healthy server must always have room to "breathe" And when you look at live metrics for such systems, you can always see that the types of loads are mixed and the mix is fluctuating: In one moment (e.g. few secs), it's this, and in the next moment it's that. The benefit of autoscale in this context is that one queue can get more workers in a moment when another queue needs lesss. But even without that, it is still beneficial. Two examples: 1. Optimizing Total Memory ConsumptionYou can stick to the counts you are currently using - but only as a maximum. If you are currently using 4 workers for a queue, you can set
This doesn't give you more than before - BUT: It still gives the server machine more memory for 30, 50, 80, 90% of the time (depending on your setup). 2. Using On-Demand WorkersWhen you have a queue that is just for those nightly scheduled tasks and non-time-critical tasks - you can do
And once again save the memory for another worker. Those 2-5s that it may take to start a worker don't really matter for many tasks.
That's true, you must not "overbook" the memory unless you have a cross-node autoscaler.
I'm afraid, but I cannot agree to that, because in some moments, a certain queue will do better when it has more workers - but id doesn't need to have them pre-allocated 24/7 and permanently keep server RAM usage at the limit. That's not a good strategy IMHO. |
Beta Was this translation helpful? Give feedback.
-
Pool TypeLet's start with this:
This is nonsense and specifically unsuitable for weblate, from all I read.
There is no thread concurrency in Python execution...but... Let's entertain the ideaWith the Python GIL, it is impossible to run any actual Python code in parallel. This means: when a task is most of the time just waiting (for a DB result to arrive, for completion of a pipeline/file read or write, for async execution of other tasks, etc.. But there are also other tasks, which get into 100% CPU load. With a Prefork pool and 4 workers (and 4 CPUs), all 4 workers can get to 100% = system at maximum load. When you use a threaded pool, each worker will only achieve 25% in parallel (= all at 25%) - or one worker 100% and the other three 0%. Maybe a possibilityIn this post, I'm cateforizing tasks by needs for availability. Another categorization might be about which are CPU bound and which are IO bound and use a threaded pool for for the latter and a prefork pool for the former.
Then you would also need to remove the max_worker_memory parameter - because that is not working with a threaded pool. And then I would read this again from the Celery docs:
This is about prefork pools. A threaded pool cannot restart its workers like a prefork pool. And finally: The higher you set the prefetch-multiplier, the more tasks will get lost when the process dies. |
Beta Was this translation helpful? Give feedback.
-
Like I said, I might have made a few mistakes in the categorization. It would be nice when you could point me at these where my triage was not right. Another possible improvement for tasks which are running long might be to split them up into smaller parts. Each part queues the next part only when it is done, so that the queue doesn't get flooded with too many tasks at once. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Part 1: Queue Organization
When looking at the process tree of celery in a default configuration, there's a total of about 2 GB RAM used by all the Celery processes.
It's not just each worker taking RAM, but each queue also has a control process for the actual workers, each of them taking about 200 MB.
That means for the 5 queues (celery, translate, notify, memory, backup) we already have 1 GB of RAM usage - without even a single worker. These usually have a concurrency of at minimum 2 except backup, so we get 9. Even without having done any work, these take 160 MB already. = 1.3 GB. As some memory is shared between parent/child (forked) processes, the effective total is a bit less than the sum.
(the most reliable verification is
systemctl stop celery-weblate
)This is a lot of memory and it brought me to wonder:
Does this really make sense?
Alone the backup queue taks 350 MB RAM - but for what? Running backups once a day is fine - but it takes just minutes usually - so do we need to spend 350 MB RAM 24/7 just for this?
Looking at the other queues and analyzing the tasks that are executed through them, brought me to a very clear answer:
This doesn't make sense at all.
Now, we should think about the following question:
What's the purpose of having multiple Celery queues?
Quite obviously, this is about to avoid blocking tasks of type A by tasks of type B, e.g. when there are a lot of tasks of type B queued and we would have just a single queue, the type A tasks might not have a chance to be executed in time - and that's really bad in cases when type A tasks are way more important and more urgent to execute than type B tasks.
So - having multiple queues does make a lot of sense actually. But what's far from ideal - is the queue organization.
New Celery Queue Proposal
The goal of having non-blocking tasks should be possible to achieve with 3 queues intead of 5 - as follows:
This is for running all those tasks which are directly related to UI operations in the web clients - means:
Handles tasks where it's expected that they are longer running and not immediately producing a result, but where it's also expected that they are executed in a timeline manner and the results will be visible at some time "soon".
It's also for all tasks which do not clearly belong into 1 or 3 - in this regard it's also meant to be the "default" queue.
Simply said: This is for all those tasks for which nobody is waiting for.
For tasks that can happen at any time (of the day) - like scheduled/interval tasks, where it doesn't matter when they are executed exactly, just that they are executed at some point.
Classification
I have painstakingly worked through every single task that is registered with Celery: Looked up where it is defined in the code, traced back all references to see from where and in which cases they are called (or scheduled).
I think I got it largely accurate, but there may well be some where it is debatable or where my lack of deep knowledge of the project has brought me to a wrong conclusion - at least it should be a good starting point, though.
Part 2: Optimizing Celery Process Configuration
Primarily driven by the throught that "this can't be true that a task runner framework takes so much memory just when idling and just with a so small amount of queues and workers", I did a lot of research and experimenting on this subject.
I'll share my findings below. Let me apologize in advance for probably mentioning things that are clear and common knowledge anyway - Python-Django-Celery is not really my home base.
Prefetch-Multiplier
The advantage of prefetching multiple tasks - i.e. a working accepting multiple tasks at once - plays out in cases where hundred or thousands of tasks are getting processed per second. When a redis request for fetching takes 1ms and 1000 are processed per second, the fetching adds another second. When fetching chunks of 10 tasks, there just 100ms overhead.
But Weblate doesn't have that kind of tasks. It's far away from such figures and a setup with a non-local redis doesn't make sense in the first place.
And there are disadvantages that weigh in:
Conclusion
My conclusion is that
--prefetch-multiplier
should always be set to 1 for Weblate.Worker Memory
Here's an interesting excerpt from the Celery docs:
The
worker_max_tasks_per_child
option doesn't make much sense to me. The one and only concern is memory and the current Weblate defaultworker_max_memory_per_child=250000
(250 MB) is well-chosen IMO.Important to note is probably that this won't kill any worker that exceeds that amount of memory while it's running. It just means that it gets shut down and respawned after completion.
OOM Kills
That's a different story. I have written about here already: #15427
In this context, I came across another setting:
WORKER_LOST_WAIT
The default is 10s. It doesn't make sense IMO to wait that long for some result that the worker might have published. That's so unlikely - at least when thinking within the context of Weblate.
IMO, a much higher interest is not to remain inoperable for so long (e.g.: single worker/concurrenc=1y).
Suggestion
Note
There should also be some a discussion about handling failed (like oom-killed) tasks and how to remedy. Question would be whether it makes sense to use TASK_ACKS_LATE and also TASK_REJECT_ON_WORKER_LOST
Pool Type
Celery can run multiple "nodes" - where "node" simply means something like a "manager process" for the workers. It's role changes depending on the pool type
Values are
This is the default and means simply "multi-process".
Means that when the "node" has 4 workers configured, there will be 5 processes:
"Manager" Process
Worker Process 1
Worker Process 2
Worker Process 3
Worker Process 4
This is similar to above, but there's only a single process - each worker is run on a separate thread.
Now - it's quite tempting to use this pool type, because it takes a lot less memory in the first place.
There are severe drawbacks, though:
I didn't say "multi-threading", because it in fact allows to work with threads and manage them, and even execute code on multiple threads.
But there's a big big BUT: None of that can happen in parallel - i.e. when some Python code is executed in one thread, no other thread can execute Python code at the same time. Other threads need to wait until the one thread hits any async IO or other call which releases the lock (while waiting), allowing an other thread to execute Python code.
This is fatal, because the process is unable to keep memory low. It cannot regain memory by killing a thread (like it works for a child process).
In turn, the process memory will just grow over time
This mode does not support anything but concurrency=1
The single worker is run on the main thread and the list of drawbacks is even longer.
Not a serious option - even though the lower memory is very tempting - admittedly.
But it doesn't pay off in the long run.
Conclusion
The weblate tasks are doing quite an amount of CPU bound tasks. Using threads for the pool mode isn't a good idea. There's a better way...
The Rescue: AutoScale
After lots of testing I came to ask the right question: Why do all those processes need to pre-exist?
So many processes: 2x notify, 2x memory, 2x translate, 2x celery, 1x backup, 1x heartbeat
It practically never happens that all workers are active at the same time.
So why should they always take all that memory at the same time? Eventually, I found autoscatel - which is a Celery feature, albeit not well-documented. Goes like this:
This means that Celery should dynamically adjust the number of workers for this node as follows:
There should always be at least one worker process running.
Depending on demand, it can create a maximum number of 4 workers.
Keepalive
Even less documented is another important setting:
This determines for how long it should keep a worker alive when there's nothing more to do.
Default is 30 seconds - I think 2 (or maybe 5) are just enough. We don't have any bursts of tasks where it would be important to be able to handle these instantly.
Anyway, we can keep one worker permanently alive.
Putting it all together
What's nice here is that it doesn't even need a source code change to try it out...
The Configuration
settings.py
Celery systemd Config
The Results
The figures shown are from a states where:
(it's a 2-CPU VM with 4 GB memory)
Memory Usage
Before: Current Suggested Configuration
USED RAM: 2.32 GB
After: With the changes proposed above
USED RAM: 1.71 GB
The Math
After stopping Celery services, used RAM went down to 1.05 GB and 0.98 GB
Celery RAM Usage Current Setup: 1.27 GB
Celery RAM Usage Proposed: 0.73 GB
Pictures
Before
Had seen this all the time:
After
I had never seen this "congratulations" message ever before...
Beta Was this translation helpful? Give feedback.
All reactions