Restructuring Celery Queues for Improved Performance and Memory Usage #15495

softworkz · 2025-07-17T04:51:26Z

softworkz
Jul 17, 2025

Part 1: Queue Organization

When looking at the process tree of celery in a default configuration, there's a total of about 2 GB RAM used by all the Celery processes.
It's not just each worker taking RAM, but each queue also has a control process for the actual workers, each of them taking about 200 MB.

That means for the 5 queues (celery, translate, notify, memory, backup) we already have 1 GB of RAM usage - without even a single worker. These usually have a concurrency of at minimum 2 except backup, so we get 9. Even without having done any work, these take 160 MB already. = 1.3 GB. As some memory is shared between parent/child (forked) processes, the effective total is a bit less than the sum.
(the most reliable verification is systemctl stop celery-weblate)

This is a lot of memory and it brought me to wonder:

Does this really make sense?

Alone the backup queue taks 350 MB RAM - but for what? Running backups once a day is fine - but it takes just minutes usually - so do we need to spend 350 MB RAM 24/7 just for this?

Looking at the other queues and analyzing the tasks that are executed through them, brought me to a very clear answer:

This doesn't make sense at all.

Now, we should think about the following question:

What's the purpose of having multiple Celery queues?

Quite obviously, this is about to avoid blocking tasks of type A by tasks of type B, e.g. when there are a lot of tasks of type B queued and we would have just a single queue, the type A tasks might not have a chance to be executed in time - and that's really bad in cases when type A tasks are way more important and more urgent to execute than type B tasks.

So - having multiple queues does make a lot of sense actually. But what's far from ideal - is the queue organization.

New Celery Queue Proposal

The goal of having non-blocking tasks should be possible to achieve with 3 queues intead of 5 - as follows:

Interactive Queue
This is for running all those tasks which are directly related to UI operations in the web clients - means:
- Tasks doing things where users expect to see a result (more or less) immediately
- Tasks that are blocking views from being rendered because Weblate is blocking a web request while waiting for a task to complete before returning an http result
Background Queue
Handles tasks where it's expected that they are longer running and not immediately producing a result, but where it's also expected that they are executed in a timeline manner and the results will be visible at some time "soon".
It's also for all tasks which do not clearly belong into 1 or 3 - in this regard it's also meant to be the "default" queue.
Maintenance Queue
Simply said: This is for all those tasks for which nobody is waiting for.
For tasks that can happen at any time (of the day) - like scheduled/interval tasks, where it doesn't matter when they are executed exactly, just that they are executed at some point.

Classification

I have painstakingly worked through every single task that is registered with Celery: Looked up where it is defined in the code, traced back all references to see from where and in which cases they are called (or scheduled).

CELERY_TASK_ROUTES = {
    # Tasks in exact order specified by user (1=Interactive, 2=Background, 3=Maintenance)
    "weblate.accounts.tasks.cleanup_auditlog": {"queue": "Maintenance"},
    "weblate.accounts.tasks.cleanup_social_auth": {"queue": "Maintenance"},
    "weblate.accounts.tasks.notify_auditlog": {"queue": "Background"},
    "weblate.accounts.tasks.notify_changes": {"queue": "Background"},
    "weblate.accounts.tasks.notify_daily": {"queue": "Maintenance"},
    "weblate.accounts.tasks.notify_monthly": {"queue": "Maintenance"},
    "weblate.accounts.tasks.notify_weekly": {"queue": "Maintenance"},
    "weblate.accounts.tasks.send_mails": {"queue": "Interactive"},
    "weblate.addons.tasks.addon_change": {"queue": "Background"},
    "weblate.addons.tasks.cdn_parse_html": {"queue": "Maintenance"},
    "weblate.addons.tasks.cleanup_addon_activity_log": {"queue": "Maintenance"},
    "weblate.addons.tasks.daily_addons": {"queue": "Maintenance"},
    "weblate.addons.tasks.language_consistency": {"queue": "Background"},
    "weblate.addons.tasks.postconfigure_addon": {"queue": "Interactive"},
    "weblate.auth.tasks.cleanup_invitations": {"queue": "Maintenance"},
    "weblate.auth.tasks.disable_expired": {"queue": "Maintenance"},
    "weblate.checks.tasks.batch_update_checks": {"queue": "Interactive"},
    "weblate.fonts.tasks.cleanup_font_files": {"queue": "Maintenance"},
    "weblate.fonts.tasks.update_fonts_cache": {"queue": "Interactive"},
    "weblate.gitexport.tasks.update_gitexport_urls": {"queue": "Maintenance"},
    "weblate.glossary.tasks.cleanup_stale_glossaries": {"queue": "Interactive"},
    "weblate.glossary.tasks.sync_glossary_languages": {"queue": "Interactive"},
    "weblate.glossary.tasks.sync_terminology": {"queue": "Interactive"},
    "weblate.memory.tasks.import_memory": {"queue": "Interactive"},
    "weblate.memory.tasks.update_memory": {"queue": "Interactive"},
    "weblate.metrics.tasks.cleanup_metrics": {"queue": "Maintenance"},
    "weblate.metrics.tasks.collect_metrics": {"queue": "Maintenance"},
    "weblate.screenshots.tasks.cleanup_screenshot_files": {"queue": "Maintenance"},
    "weblate.trans.tasks.actual_project_removal": {"queue": "Background"},
    "weblate.trans.tasks.auto_translate": {"queue": "Interactive"},
    "weblate.trans.tasks.auto_translate_component": {"queue": "Background"},
    "weblate.trans.tasks.category_removal": {"queue": "Interactive"},
    "weblate.trans.tasks.cleanup_component": {"queue": "Background"},
    "weblate.trans.tasks.cleanup_old_comments": {"queue": "Maintenance"},
    "weblate.trans.tasks.cleanup_old_suggestions": {"queue": "Maintenance"},
    "weblate.trans.tasks.cleanup_project_backup_download": {"queue": "Maintenance"},
    "weblate.trans.tasks.cleanup_project_backups": {"queue": "Maintenance"},
    "weblate.trans.tasks.cleanup_stale_repos": {"queue": "Maintenance"},
    "weblate.trans.tasks.cleanup_suggestions": {"queue": "Maintenance"},
    "weblate.trans.tasks.commit_pending": {"queue": "Maintenance"},
    "weblate.trans.tasks.component_after_save": {"queue": "Interactive"},
    "weblate.trans.tasks.component_alerts": {"queue": "Interactive"},
    "weblate.trans.tasks.component_removal": {"queue": "Background"},
    "weblate.trans.tasks.create_component": {"queue": "Interactive"},
    "weblate.trans.tasks.create_project_backup": {"queue": "Maintenance"},
    "weblate.trans.tasks.daily_update_checks": {"queue": "Maintenance"},
    "weblate.trans.tasks.perform_commit": {"queue": "Background"},
    "weblate.trans.tasks.perform_load": {"queue": "Interactive"},
    "weblate.trans.tasks.perform_push": {"queue": "Interactive"},
    "weblate.trans.tasks.perform_update": {"queue": "Interactive"},
    "weblate.trans.tasks.project_removal": {"queue": "Background"},
    "weblate.trans.tasks.remove_project_backup_download": {"queue": "Background"},
    "weblate.trans.tasks.repository_alerts": {"queue": "Maintenance"},
    "weblate.trans.tasks.update_checks": {"queue": "Background"},
    "weblate.trans.tasks.update_enforced_checks": {"queue": "Interactive"},
    "weblate.trans.tasks.update_remotes": {"queue": "Maintenance"},
    "weblate.utils.tasks.database_backup": {"queue": "Background"},
    "weblate.utils.tasks.heartbeat": {"queue": "Interactive"},
    "weblate.utils.tasks.ping": {"queue": "Interactive"},
    "weblate.utils.tasks.settings_backup": {"queue": "Background"},
    "weblate.utils.tasks.update_language_stats_parents": {"queue": "Background"},
    "weblate.utils.tasks.update_translation_stats_parents": {"queue": "Background"},
    "weblate.wladmin.tasks.backup": {"queue": "Background"},
    "weblate.wladmin.tasks.backup_service": {"queue": "Background"},
    "weblate.wladmin.tasks.support_status_update": {"queue": "Maintenance"},
}

I think I got it largely accurate, but there may well be some where it is debatable or where my lack of deep knowledge of the project has brought me to a wrong conclusion - at least it should be a good starting point, though.

Part 2: Optimizing Celery Process Configuration

Primarily driven by the throught that "this can't be true that a task runner framework takes so much memory just when idling and just with a so small amount of queues and workers", I did a lot of research and experimenting on this subject.

I'll share my findings below. Let me apologize in advance for probably mentioning things that are clear and common knowledge anyway - Python-Django-Celery is not really my home base.

Prefetch-Multiplier

The advantage of prefetching multiple tasks - i.e. a working accepting multiple tasks at once - plays out in cases where hundred or thousands of tasks are getting processed per second. When a redis request for fetching takes 1ms and 1000 are processed per second, the fetching adds another second. When fetching chunks of 10 tasks, there just 100ms overhead.
But Weblate doesn't have that kind of tasks. It's far away from such figures and a setup with a non-local redis doesn't make sense in the first place.

And there are disadvantages that weigh in:

additional "queues" are created in each worker
- these are invisible
- it gives a false picture when the regular queue is empty but there are still 20 tasks queued in the workers
it creates inefficiencies: if one worker has a number of long-running tasks it happens that several tasks are still queued in one worker even though another one would be free

Conclusion

My conclusion is that --prefetch-multiplier should always be set to 1 for Weblate.

--prefetch-multiplier=1

Worker Memory

Here's an interesting excerpt from the Celery docs:

Keep in mind, Python process memory usage has a “high watermark” and will not return memory to the operating system until the child process has stopped. This means a single high memory usage task could permanently increase the memory usage of a child process until it’s restarted. Fixing this may require adding chunking logic to your task to reduce peak memory usage.

The worker_max_tasks_per_child option doesn't make much sense to me. The one and only concern is memory and the current Weblate default worker_max_memory_per_child=250000 (250 MB) is well-chosen IMO.

Important to note is probably that this won't kill any worker that exceeds that amount of memory while it's running. It just means that it gets shut down and respawned after completion.

OOM Kills

That's a different story. I have written about here already: #15427

In this context, I came across another setting: WORKER_LOST_WAIT

The default is 10s. It doesn't make sense IMO to wait that long for some result that the worker might have published. That's so unlikely - at least when thinking within the context of Weblate.
IMO, a much higher interest is not to remain inoperable for so long (e.g.: single worker/concurrenc=1y).

Suggestion

CELERY_WORKER_LOST_WAIT=1

Note

There should also be some a discussion about handling failed (like oom-killed) tasks and how to remedy. Question would be whether it makes sense to use TASK_ACKS_LATE and also TASK_REJECT_ON_WORKER_LOST

Pool Type

Celery can run multiple "nodes" - where "node" simply means something like a "manager process" for the workers. It's role changes depending on the pool type

Values are

Prefork
This is the default and means simply "multi-process".
Means that when the "node" has 4 workers configured, there will be 5 processes:
"Manager" Process
Worker Process 1
Worker Process 2
Worker Process 3
Worker Process 4
Threads
This is similar to above, but there's only a single process - each worker is run on a separate thread.
Now - it's quite tempting to use this pool type, because it takes a lot less memory in the first place.
There are severe drawbacks, though:
- Python does not support parallelism
  I didn't say "multi-threading", because it in fact allows to work with threads and manage them, and even execute code on multiple threads.
  But there's a big big BUT: None of that can happen in parallel - i.e. when some Python code is executed in one thread, no other thread can execute Python code at the same time. Other threads need to wait until the one thread hits any async IO or other call which releases the lock (while waiting), allowing an other thread to execute Python code.
- The worker_max_memory_per_child setting doesn't work
  This is fatal, because the process is unable to keep memory low. It cannot regain memory by killing a thread (like it works for a child process).
  In turn, the process memory will just grow over time
- If one worker crashes, the whole process is lost - which means all other workers will have effectively crashed as well
- Similar in case of an OOM kill (which is more likely in this config): all lost.
Solo
This mode does not support anything but concurrency=1
The single worker is run on the main thread and the list of drawbacks is even longer.
Not a serious option - even though the lower memory is very tempting - admittedly.
But it doesn't pay off in the long run.

Conclusion

The weblate tasks are doing quite an amount of CPU bound tasks. Using threads for the pool mode isn't a good idea. There's a better way...

The Rescue: AutoScale

After lots of testing I came to ask the right question: Why do all those processes need to pre-exist?

So many processes: 2x notify, 2x memory, 2x translate, 2x celery, 1x backup, 1x heartbeat
It practically never happens that all workers are active at the same time.
So why should they always take all that memory at the same time? Eventually, I found autoscatel - which is a Celery feature, albeit not well-documented. Goes like this:

--autoscale:Interactive=4,1

This means that Celery should dynamically adjust the number of workers for this node as follows:

There should always be at least one worker process running.
Depending on demand, it can create a maximum number of 4 workers.

Keepalive

Even less documented is another important setting:

CELERY_WORKER_AUTOSCALE_KEEPALIVE=2

This determines for how long it should keep a worker alive when there's nothing more to do.
Default is 30 seconds - I think 2 (or maybe 5) are just enough. We don't have any bursts of tasks where it would be important to be able to handle these instantly.
Anyway, we can keep one worker permanently alive.

Putting it all together

What's nice here is that it doesn't even need a source code change to try it out...

The Configuration

settings.py

CELERY_WORKER_LOST_WAIT = 0.5  # Reduced wait time for lost workers
CELERY_WORKER_AUTOSCALE_KEEPALIVE = 2
CELERY_TASK_DEFAULT_QUEUE="Background"

CELERY_TASK_ROUTES = {
    # Tasks in exact order specified by user (1=Interactive, 2=Background, 3=Maintenance)
    "weblate.accounts.tasks.cleanup_auditlog": {"queue": "Maintenance"},
    "weblate.accounts.tasks.cleanup_social_auth": {"queue": "Maintenance"},
    "weblate.accounts.tasks.notify_auditlog": {"queue": "Background"},
    "weblate.accounts.tasks.notify_changes": {"queue": "Background"},
    "weblate.accounts.tasks.notify_daily": {"queue": "Maintenance"},
    "weblate.accounts.tasks.notify_monthly": {"queue": "Maintenance"},
    "weblate.accounts.tasks.notify_weekly": {"queue": "Maintenance"},
    "weblate.accounts.tasks.send_mails": {"queue": "Interactive"},
    "weblate.addons.tasks.addon_change": {"queue": "Background"},
    "weblate.addons.tasks.cdn_parse_html": {"queue": "Maintenance"},
    "weblate.addons.tasks.cleanup_addon_activity_log": {"queue": "Maintenance"},
    "weblate.addons.tasks.daily_addons": {"queue": "Maintenance"},
    "weblate.addons.tasks.language_consistency": {"queue": "Background"},
    "weblate.addons.tasks.postconfigure_addon": {"queue": "Interactive"},
    "weblate.auth.tasks.cleanup_invitations": {"queue": "Maintenance"},
    "weblate.auth.tasks.disable_expired": {"queue": "Maintenance"},
    "weblate.checks.tasks.batch_update_checks": {"queue": "Interactive"},
    "weblate.fonts.tasks.cleanup_font_files": {"queue": "Maintenance"},
    "weblate.fonts.tasks.update_fonts_cache": {"queue": "Interactive"},
    "weblate.gitexport.tasks.update_gitexport_urls": {"queue": "Maintenance"},
    "weblate.glossary.tasks.cleanup_stale_glossaries": {"queue": "Interactive"},
    "weblate.glossary.tasks.sync_glossary_languages": {"queue": "Interactive"},
    "weblate.glossary.tasks.sync_terminology": {"queue": "Interactive"},
    "weblate.memory.tasks.import_memory": {"queue": "Interactive"},
    "weblate.memory.tasks.update_memory": {"queue": "Interactive"},
    "weblate.metrics.tasks.cleanup_metrics": {"queue": "Maintenance"},
    "weblate.metrics.tasks.collect_metrics": {"queue": "Maintenance"},
    "weblate.screenshots.tasks.cleanup_screenshot_files": {"queue": "Maintenance"},
    "weblate.trans.tasks.actual_project_removal": {"queue": "Background"},
    "weblate.trans.tasks.auto_translate": {"queue": "Interactive"},
    "weblate.trans.tasks.auto_translate_component": {"queue": "Background"},
    "weblate.trans.tasks.category_removal": {"queue": "Interactive"},
    "weblate.trans.tasks.cleanup_component": {"queue": "Background"},
    "weblate.trans.tasks.cleanup_old_comments": {"queue": "Maintenance"},
    "weblate.trans.tasks.cleanup_old_suggestions": {"queue": "Maintenance"},
    "weblate.trans.tasks.cleanup_project_backup_download": {"queue": "Maintenance"},
    "weblate.trans.tasks.cleanup_project_backups": {"queue": "Maintenance"},
    "weblate.trans.tasks.cleanup_stale_repos": {"queue": "Maintenance"},
    "weblate.trans.tasks.cleanup_suggestions": {"queue": "Maintenance"},
    "weblate.trans.tasks.commit_pending": {"queue": "Maintenance"},
    "weblate.trans.tasks.component_after_save": {"queue": "Interactive"},
    "weblate.trans.tasks.component_alerts": {"queue": "Interactive"},
    "weblate.trans.tasks.component_removal": {"queue": "Background"},
    "weblate.trans.tasks.create_component": {"queue": "Interactive"},
    "weblate.trans.tasks.create_project_backup": {"queue": "Maintenance"},
    "weblate.trans.tasks.daily_update_checks": {"queue": "Maintenance"},
    "weblate.trans.tasks.perform_commit": {"queue": "Background"},
    "weblate.trans.tasks.perform_load": {"queue": "Interactive"},
    "weblate.trans.tasks.perform_push": {"queue": "Interactive"},
    "weblate.trans.tasks.perform_update": {"queue": "Interactive"},
    "weblate.trans.tasks.project_removal": {"queue": "Background"},
    "weblate.trans.tasks.remove_project_backup_download": {"queue": "Background"},
    "weblate.trans.tasks.repository_alerts": {"queue": "Maintenance"},
    "weblate.trans.tasks.update_checks": {"queue": "Background"},
    "weblate.trans.tasks.update_enforced_checks": {"queue": "Interactive"},
    "weblate.trans.tasks.update_remotes": {"queue": "Maintenance"},
    "weblate.utils.tasks.database_backup": {"queue": "Background"},
    "weblate.utils.tasks.heartbeat": {"queue": "Interactive"},
    "weblate.utils.tasks.ping": {"queue": "Interactive"},
    "weblate.utils.tasks.settings_backup": {"queue": "Background"},
    "weblate.utils.tasks.update_language_stats_parents": {"queue": "Background"},
    "weblate.utils.tasks.update_translation_stats_parents": {"queue": "Background"},
    "weblate.wladmin.tasks.backup": {"queue": "Background"},
    "weblate.wladmin.tasks.backup_service": {"queue": "Background"},
    "weblate.wladmin.tasks.support_status_update": {"queue": "Maintenance"},
}

Celery systemd Config

CELERYD_NODES="Interactive Background Maintenance"
CELERYD_OPTS="--beat:Interactive --queues:Interactive=Interactive --prefetch-multiplier:Interactive=1 --autoscale:Interactive=4,1 \
    --queues:Background=Background --prefetch-multiplier:Background=1 --autoscale:Background=2,1 \
    --queues:Maintenance=Maintenance --prefetch-multiplier:Maintenance=1 --autoscale:Maintenance=1,0"
CELERY_LOG_DIR="${WEBLATE_LOG_DIR}/celery"


# Create environment configuration file (official Weblate pattern)
log "INFO" "Creating Celery environment configuration"
cat <<EOF | sudo tee /etc/default/celery-weblate > /dev/null
# Multi-node Celery configuration (standard Weblate setup)
CELERYD_NODES="${CELERYD_NODES}"
CELERY_BIN="${VENV_PATH}/bin/celery"
CELERY_APP="weblate.utils"
CELERY_TASK_DEFAULT_QUEUE="Background"
CELERY_DEFAULT_QUEUE="Background"
CELERYD_OPTS="${CELERYD_OPTS}"
CELERYD_PID_FILE="/run/celery/weblate-%n.pid"
CELERYD_LOG_FILE="${CELERY_LOG_DIR}/weblate-%n%I.log"
CELERYD_LOG_LEVEL="INFO"
AUTOSCALE_KEEPALIVE=2
CELERYD_AUTOSCALE_KEEPALIVE=2
CELERY_AUTOSCALE_KEEPALIVE=2
MALLOC_ARENA_MAX=1
EOF

# Create systemd service (official Weblate pattern)
log "INFO" "Creating Celery systemd service"
cat <<EOF | sudo tee /etc/systemd/system/celery-weblate.service > /dev/null
[Unit]
Description=Celery Service (Weblate)
After=network.target redis-server.service
Wants=redis-server.service

[Service]
Type=forking
User=${WEBLATE_USER}
Group=${WEBLATE_GROUP}
EnvironmentFile=/etc/default/celery-weblate
WorkingDirectory=${WEBLATE_USER_HOME_DIR}
RuntimeDirectory=celery
RuntimeDirectoryPreserve=restart
ExecStart=/bin/bash -c '${CELERY_BIN} multi start \${CELERYD_NODES} \
  -A \${CELERY_APP} --pidfile=\${CELERYD_PID_FILE} \
  --logfile=\${CELERYD_LOG_FILE} --loglevel=\${CELERYD_LOG_LEVEL} \${CELERYD_OPTS}'
ExecStop=/bin/bash -c '${CELERY_BIN} multi stopwait \${CELERYD_NODES} \
  --pidfile=\${CELERYD_PID_FILE}'
ExecReload=/bin/bash -c '${CELERY_BIN} multi restart \${CELERYD_NODES} \
  -A \${CELERY_APP} --pidfile=\${CELERYD_PID_FILE} \
  --logfile=\${CELERYD_LOG_FILE} --loglevel=\${CELERYD_LOG_LEVEL} \${CELERYD_OPTS}'
Restart=always
RestartSec=1

[Install]
WantedBy=multi-user.target
EOF

The Results

The figures shown are from a states where:

All services (Redis, Apache, Celery) have been started fresh
Login to the web UI
Going to "Performance Report"
Open "All Projects" view in a new tab
Wait until all startup tasks have run and the projects page has loaded

(it's a 2-CPU VM with 4 GB memory)

Memory Usage

Before: Current Suggested Configuration

USED RAM: 2.32 GB

    PID VIRT   RES   SHR MEM%  Command
  54631 349M  205M 46100  5.3  ├─ backup
  54863 350M  166M  5568  4.2  │  └─ backup
  54576 349M  205M 46232  5.2  ├─ celery
  54636 353M  182M 16880  4.6  │  ├─ celery
  54641 524M  213M 17284  5.4  │  ├─ celery
  54645 527M  214M 17192  5.5  │  └─ celery
  54607 349M  205M 45880  5.2  ├─ memory
  54842 350M  166M  5672  4.2  │  ├─ memory
  54846 496M  180M 16664  4.6  │  └─ memory
  54584 349M  204M 46020  5.2  ├─ notify
  54775 350M  167M  5584  4.3  │  ├─ notify
  54809 496M  180M 16768  4.6  │  └─ notify
  54651 349M  205M 45936  5.2  ├─ translate
  54990 350M  166M  5636  4.2  │  ├─ translate
  55021 350M  166M  5496  4.2  │  └─ translate

After: With the changes proposed above

USED RAM: 1.71 GB

  72679 285M  204M 46268  5.2  ├─ Background
  72775 287M  165M  5608  4.2  │  └─ Background
  72665 285M  204M 46012  5.2  ├─ Interactive
  72756 289M  179M 16232  4.6  │  ├─ Interactive
  72788 287M  167M  5748  4.3  │  └─ Interactive
  72699 285M  204M 46204  5.2  ├─ Maintenance

The Math

After stopping Celery services, used RAM went down to 1.05 GB and 0.98 GB

Celery RAM Usage Current Setup: 1.27 GB

Celery RAM Usage Proposed: 0.73 GB

Pictures

Before

Had seen this all the time:

After

I had never seen this "congratulations" message ever before...

nijel · 2025-07-18T08:45:49Z

nijel
Jul 18, 2025
Maintainer

Celery Queue Proposal

The current split into queues has its reasons, and changing that might cause issues. I've started to write the reasoning here, but then I realized it better fits into the documentation, so it is in #15496. Your changes will make the observed performance much worse because it combines tasks that are not that important and can take long (like translation memory updates or automatic translation) with regular ones.

There might be a better way to split the tasks (the queues were not revisited for years), but your proposal is making things worse.

Prefetch-Multiplier

I don't think this will make a visible impact, but it can probably be lowered for backup and celery queues (which have already lowered this to 2 from default 4).

For notify and memory this is intentionally higher because these queues often contain tasks that are fast to complete, but there are many of them.

OOM Kills

The root cause needs to be addressed here. WORKER_LOST_WAIT=1 is most likely too short for systems under a heavy load, and this should be rare, so waiting 10 seconds should not matter much.

We've considered TASK_ACKS_LATE or TASK_REJECT_ON_WORKER_LOST in the past, but it could bring other issues. In your case the OOM happened in a way that part of the task was completed, but when it always happens when executing the task, repeating the tasks will just repeatedly trigger OOM.

Pool Type

I believe that thread-based concurrency is the way to go. Actually, our Docker container uses this since the 5.12 release (WeblateOrg/docker#3371). We should probably update the main documentation and examples to be consistent with that.

We've experimented with eventlets as well, but some of our dependencies are not safe in this environment.

The Rescue: AutoScale

It practically never happens that all workers are active at the same time.

If your Weblate instance is not much used, this can be observed, but this is not generally true.

Autoscaling can make your system behave badly under higher load because it spawns additional processes in that time. This can also be another cause for OOM because each of the newly spawned processes.

Overall, autoscaling will cause less memory consumption when the system is not busy, but there will be no difference when it is busy. If your Weblate server is not having much of a load, you can reduce the number of workers permanently; the provided configuration is just an example.

0 replies

softworkz · 2025-07-18T21:50:26Z

softworkz
Jul 18, 2025
Author

Let's get at the core before I respond to the smaller bits - and look at an example, right from the newly added documentation:

notify

Delivers notification e-mails, both for events within Weblate and for authentication or registration. This is a separate queue to make e-mail delivery smooth even if there is a backlog of other tasks.

This is a perfect example to show why the current categorization does not make sense. There are currently 7 tasks which are executed on the notify queue. But those tasks are serving quite different purposes - they may be all sending e-mails - but in very different situations and for different purposes. I see three groups actually:

Digest Sending

These tasks are sending e-mails which do not have any time constraints. Whether they are sent some hours earlier or later doesn't matter at all. They can be sent nightly and can be executed on a queue with a single worker - together with many other tasks of that kind. The keeps the load for these things low and spread over a longer range of time instead of causing bursts at certain points in time.
It is counter-productive to have a queue with two or even more workers for these.

weblate.accounts.tasks.notify_daily
weblate.accounts.tasks.notify_monthly
weblate.accounts.tasks.notify_weekly

Change Notifications

These three are about sending change notifications. Users expect such notifications to arrive in a timely manner, but when it would sometimes happen that it takes a number of minutes until they arrive, it's still fine.
That's why I'm putting them into the "Background" queue (see description in the OP)

weblate.accounts.tasks.notify_auditlog
weblate.accounts.tasks.notify_changes
weblate.addons.tasks.addon_change

E-Mails for direct User Interaction

Mails sent by this task are important and need to be sent as soon as possible.
That's why I put it into the "Interactive" queue (no long-running tasks are handled by this queue)

weblate.accounts.tasks.send_mails

You wrote:

This is a separate queue to make e-mail delivery smooth

"smooth delivery" is very important indeed - but that applies to just one of those tasks - not to the other six ones.

It doesn't make sense to put all those tasks into the same queue just because they are all sending e-mails. Each task needs to be triaged and categorized individually - that's the core of my proposal and the things I'm saying here apply to other queues just in the same way.

3 replies

softworkz Jul 18, 2025
Author

PS: Please apologize - there was a typing/mixup error for the send_mails task, showing it for the "Maintenance" queue. Of course, this one is in the "Interactive" queue.

nijel Jul 22, 2025
Maintainer

send_mails task is mostly triggered by other notification tasks. It's just that the e-mail delivery is separated from generating it.

softworkz Jul 29, 2025
Author

Well right, the background/digest notifications should send the e-mails synchronously or use a separate send_mails task which runs on the background queue. Since those tasks are creating packages of 200 mails to send, this can block sending of important ("interactive") e-mails - even when you use two notification queues.
This again shows that the current categorization is not well suited. There's really no need for having extra notification queues. There need to be one queue where you can be sure that tasks are executed instantly and no longer-running tasks are queued on. The rest of e-mail sending can be background.
That's a simple and more effective strategy than the current one.

softworkz · 2025-07-18T22:26:29Z

softworkz
Jul 18, 2025
Author

It practically never happens that all workers are active at the same time.

If your Weblate instance is not much used, this can be observed, but this is not generally true.

4 celery workers
2 notify workers
2 memory workers
2 translate workers
1 backup worker

All working in parallel - with all queues continuously filled?

That's not a realistic case. I repeat: This NEVER happens...
....unless something is going totally wrong. When a server gets into such conditions, it is no longer reasonably usable for any user. Evenmore, given the fact that a number of page loads are waiting for celery worker completion to finish responding to http requests.

Hence, that is not a case to even put under consideration (except trying to prevent the worst).

A healthy server must always have room to "breathe"

And when you look at live metrics for such systems, you can always see that the types of loads are mixed and the mix is fluctuating:

In one moment (e.g. few secs), it's this, and in the next moment it's that. The benefit of autoscale in this context is that one queue can get more workers in a moment when another queue needs lesss.
Of course it would be even better to write a custom autoscaler which coordinates worked counts across queues.

But even without that, it is still beneficial. Two examples:

1. Optimizing Total Memory Consumption

You can stick to the counts you are currently using - but only as a maximum. If you are currently using 4 workers for a queue, you can set

--autoscale:Interactive=4,2

This doesn't give you more than before - BUT: It still gives the server machine more memory for 30, 50, 80, 90% of the time (depending on your setup).

2. Using On-Demand Workers

When you have a queue that is just for those nightly scheduled tasks and non-time-critical tasks - you can do

--autoscale:Interactive=1,0

And once again save the memory for another worker. Those 2-5s that it may take to start a worker don't really matter for many tasks.

Autoscaling can make your system behave badly under higher load because it spawns additional processes in that time. This can also be another cause for OOM because each of the newly spawned processes.

That's true, you must not "overbook" the memory unless you have a cross-node autoscaler.

Overall, autoscaling will cause less memory consumption when the system is not busy, but there will be no difference when it is busy. If your Weblate server is not having much of a load, you can reduce the number of workers permanently;

I'm afraid, but I cannot agree to that, because in some moments, a certain queue will do better when it has more workers - but id doesn't need to have them pre-allocated 24/7 and permanently keep server RAM usage at the limit. That's not a good strategy IMHO.

2 replies

nijel Jul 22, 2025
Maintainer

Currently all the celery, notify and memory queues get a task for each edit. So with a few people editing concurrently, each of the workers will be busy (not 100% of the time, but there will be continuous stream of tasks to process).

softworkz Jul 29, 2025
Author

Yes, I understand that, and of course - specifically - the need for different queues, that totally clear. I just mean that it's not the best concept to separate the queues by what the tasks are doing. It should be decided by the urgency of completion of those tasks.

softworkz · 2025-07-18T22:52:47Z

softworkz
Jul 18, 2025
Author

Pool Type

Let's start with this:

We've experimented with eventlets as well, but some of our dependencies are not safe in this environment.

This is nonsense and specifically unsuitable for weblate, from all I read.

I believe that thread-based concurrency is the way to go. Actually

There is no thread concurrency in Python execution...but...

Let's entertain the idea

With the Python GIL, it is impossible to run any actual Python code in parallel.
What does work in parallel with Python is: waiting.

This means: when a task is most of the time just waiting (for a DB result to arrive, for completion of a pipeline/file read or write, for async execution of other tasks, etc..
For all those kinds of tasks, a threaded worker pool is fine.

But there are also other tasks, which get into 100% CPU load. With a Prefork pool and 4 workers (and 4 CPUs), all 4 workers can get to 100% = system at maximum load. When you use a threaded pool, each worker will only achieve 25% in parallel (= all at 25%) - or one worker 100% and the other three 0%.

Maybe a possibility

In this post, I'm cateforizing tasks by needs for availability. Another categorization might be about which are CPU bound and which are IO bound and use a threaded pool for for the latter and a prefork pool for the former.
But I think that availability is a more important criterium.

, our Docker container uses this since the 5.12 release (WeblateOrg/docker#3371). We should probably update the main documentation and examples to be consistent with that.

Then you would also need to remove the max_worker_memory parameter - because that is not working with a threaded pool.

And then I would read this again from the Celery docs:

Keep in mind, Python process memory usage has a “high watermark” and will not return memory to the operating system until the child process has stopped. This means a single high memory usage task could permanently increase the memory usage of a child process until it’s restarted. Fixing this may require adding chunking logic to your task to reduce peak memory usage.

This is about prefork pools. A threaded pool cannot restart its workers like a prefork pool.
Memory is more a kind of a one-way path. And when OOM kill happens, it affects all workers and all tasks are gone - and won't be re-scheduled.

And finally: The higher you set the prefetch-multiplier, the more tasks will get lost when the process dies.

2 replies

nijel Jul 22, 2025
Maintainer

I don't think there are any tasks that are really CPU-bound. Most are database-bound (which might be CPU bound, but that's a different process, possibly on a different node) or IO-bound (updating repositories); that's why a threaded worker pool works well.

But you're right with the memory usage; the limit does not seem to be applied in a threaded pool.

softworkz Jul 29, 2025
Author

I don't think there are any tasks that are really CPU-bound

I often see CPU usage:

weblate-cpu.mp4

I'm seeing others which are in fact waiting for DB results, but when they've got those results there's still high CPU for some seconds.

But yeah - there are surely tasks which are only IO-bound - but whether this would justify using a mix of a threaded pool and a process-based pool? I'm not sure because of the memory problems..

softworkz · 2025-07-18T23:01:09Z

softworkz
Jul 18, 2025
Author

Your changes will make the observed performance much worse because it combines tasks that are not that important and can take long (like translation memory updates or automatic translation) with regular ones

Like I said, I might have made a few mistakes in the categorization. It would be nice when you could point me at these where my triage was not right.

Another possible improvement for tasks which are running long might be to split them up into smaller parts. Each part queues the next part only when it is done, so that the queue doesn't get flooded with too many tasks at once.

2 replies

nijel Jul 22, 2025
Maintainer

There definitely have to be more queues than you've outlined. The translation memory and automatic translation need to have separate queues to be able to scale them separately.

In most cases the tasks are already quite limited in scope, but there is always room for improvement.

I certainly want to split out some parts from the repository parsing (perform_load), some paths to that are described in #13623.

softworkz Jul 29, 2025
Author

The translation memory and automatic translation need to have separate queues to be able to scale them separately.

Yea, maybe. I'm not sure about memory, but automatic translation may well be not suitable for one of the three queues I'm proposing above.
But for a memory constrained system, I believe my suggestion is still by far better than any of the other options that have been suggested and/or tried by users, don't you think?

What I can say for sure after those 2 weeks, is that our setup on a 4GB/2-CPUI VM is running way better than before (had triied all the suggested ways).

(My only issue are still long-running db queries but I'll follow up in the other topic..)

Weblate

Restructuring Celery Queues for Improved Performance and Memory Usage #15495

Uh oh!

Uh oh!

softworkz Jul 17, 2025

Part 1: Queue Organization

Does this really make sense?

What's the purpose of having multiple Celery queues?

New Celery Queue Proposal

Classification

Part 2: Optimizing Celery Process Configuration

Prefetch-Multiplier

Conclusion

Worker Memory

OOM Kills

Suggestion

Pool Type

Conclusion

The Rescue: AutoScale

Keepalive

Putting it all together

The Configuration

settings.py

Celery systemd Config

The Results

Memory Usage

Before: Current Suggested Configuration

After: With the changes proposed above

The Math

Pictures

Before

After

Replies: 5 comments · 9 replies

Uh oh!

nijel Jul 18, 2025 Maintainer

Celery Queue Proposal

Prefetch-Multiplier

OOM Kills

Pool Type

The Rescue: AutoScale

Uh oh!

softworkz Jul 18, 2025 Author

notify

Digest Sending

Change Notifications

E-Mails for direct User Interaction

Uh oh!

softworkz Jul 18, 2025 Author

Uh oh!

nijel Jul 22, 2025 Maintainer

Uh oh!

softworkz Jul 29, 2025 Author

Uh oh!

softworkz Jul 18, 2025 Author

1. Optimizing Total Memory Consumption

2. Using On-Demand Workers

Uh oh!

nijel Jul 22, 2025 Maintainer

Uh oh!

softworkz Jul 29, 2025 Author

Uh oh!

softworkz Jul 18, 2025 Author

Pool Type

Let's entertain the idea

Maybe a possibility

Uh oh!

Uh oh!

nijel Jul 22, 2025 Maintainer

Uh oh!

softworkz Jul 29, 2025 Author

Uh oh!

softworkz Jul 18, 2025 Author

Uh oh!

nijel Jul 22, 2025 Maintainer

Uh oh!

softworkz Jul 29, 2025 Author

softworkz
Jul 17, 2025

Replies: 5 comments 9 replies

nijel
Jul 18, 2025
Maintainer

softworkz
Jul 18, 2025
Author

softworkz Jul 18, 2025
Author

nijel Jul 22, 2025
Maintainer

softworkz Jul 29, 2025
Author

softworkz
Jul 18, 2025
Author

nijel Jul 22, 2025
Maintainer

softworkz Jul 29, 2025
Author

softworkz
Jul 18, 2025
Author

nijel Jul 22, 2025
Maintainer

softworkz Jul 29, 2025
Author

softworkz
Jul 18, 2025
Author

nijel Jul 22, 2025
Maintainer

softworkz Jul 29, 2025
Author