Remote bucket storage concerns #264

js2702 · 2025-05-14T11:14:05Z

js2702
May 14, 2025

Hello, we are currently testing PowerSync through a private beta with some of our users. We currently have onboarded 3800 enrolled users. Those users are using ~17 GB. Seeing as we have north of 300K users we would like to find some way to optimize this storage.

As this was a migration from an existing app, which only was using SQLite, we had to reach some compromises.

1. Sync op ids.

For example, id columns are duplicated because we usually name id columns with a prefix for the table. For example the food table the id column would be f_id. We have a large codebase with joins and multiple queries which could start getting conflicts if we were to use the "id" name. That's why we opted to duplicate the id column in the sync rules, keeping the original column too.

Would it be possible to have some kind of mapping in powersync where we could specify the name of the powersync op id for a particular table?

2. Column names

Exploring the bucket storage in mongoDb in our local development setup we've noticed that it stores the full row as a Json string. That includes the column names, which on large buckets like in our domain could be greatly affected. For instance taking the data string field for an op in our example the full data field takes 377 bytes and 166 of those bytes are column names (14 columns) without the double quotes.

An idea would be to encode the column names to single characters in a transparent way. This would reduce in our example the column names to only 14 bytes, instead of 166 bytes.

3. Dates

We have a few timestamp columns in some of our large tables. Currently each date is using the ISO text format so it takes about 24 bytes. If we were to use milliseconds since epoch encoded as a 64 bits integer (int8) it could represent up to September 13, 275760 AD, and it would only take 8 bytes.

4. Extra

I'm not familiar with MongoDb, but wouldn't using an Object in mongo for the data field take less space? Assuming you can read the column/field names in the same way.

Hopefully some of these optimizations are feasible.

Keep up the great work!

rkistner · 2025-05-15T08:32:36Z

rkistner
May 15, 2025
Maintainer

Hi @js2702, thanks for the feedback! I'll try to give some insight into the design decisions leading to the current architecture.
I assume you're using a self-hosted setup with MongoDB for bucket storage?

Would it be possible to have some kind of mapping in powersync where we could specify the name of the powersync op id for a particular table?

The hardcoded id primary key as currently used is quite ingrained in large parts of PowerSync at the moment, including on the client, server and sync protocol. We'll consider customizing this later, but it's not something that will happen soon.

A smaller feature that might work for your use case is if we add support for "column aliases" on the client. That way you could sync the data using id, but alias it as f_id on the client. You would still need manual work to specify the column aliases in sync rules and again in the client schema, but it could avoid the duplication.

Column names

We currently store the data as raw JSON strings, since that is what the client works with, and that requires the least amount of parsing and re-serialization throughout the sync process. For the most part the data is kept as a string all the way from bucket storage on the service to storage on the client, until the data is queried on the client - we're not even doing JSON parsing on it when syncing.

We are considering adding BSON as an option - this is what MongoDB uses for "objects". The main advantage would be for supporting binary data though, rather than reducing storage size or parsing overhead.

An idea would be to encode the column names to single characters in a transparent way.

The "column aliases" idea from above could actually also work here - that would allow you to alias to shorter column names in sync rules, and back to the full column names on the client.

Dates

We're using the ISO8601 format by default since it's simple to inspect and work with. You can however convert to unix epoch in your sync rules - see the example here. Of course the number is still stored as a JSON string, so it's still going to be more than 8 bytes.

I'm not familiar with MongoDb, but wouldn't using an Object in mongo for the data field take less space? Assuming you can read the column/field names in the same way.

Storing the data directly as "objects" in the MongoDB storage database won't reduce the storage overhead significantly - it would still include the column names for each row, and it would add overhead to convert to JSON on each sync request.

For the 17GB you're currently seeing - is that the "storage size" (often compressed), or data size (uncompressed size)? With MongoDB you can enable snappy or zstd compression, which can significantly reduce storage size, especially for JSON data like this.

Other notes

Note that you can also use Postgres for bucket storage as alternative, if you prefer that over MongoDB. It will have most of the same storage size trade-offs though, and is currently not quite as optimized as our MongoDB implementation.

We are also planning on adding support for S3 or equivalent object storage, to offload the bulk data storage (with MongoDB/Postgres still used for the real-time updates). That could be significantly cheaper and scale better when working with large amounts of data. We don't have specific timelines for this yet though.

1 reply

js2702 May 19, 2025
Author

Thank you for the answer @rkistner. To clarify, we are not using the self hosted version in production, but we do use the docker setup in development.

Regarding the column aliases on the client. Is this something that is in progress? It would be really useful for us.

Would the S3 storage allow saving the ops history in S3 and only use the mongo/postgres for bucket queries?

The storage is what we see in this chart:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remote bucket storage concerns #264

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Remote bucket storage concerns #264

Uh oh!

Uh oh!

js2702 May 14, 2025

1. Sync op ids.

2. Column names

3. Dates

4. Extra

Replies: 1 comment · 1 reply

Uh oh!

rkistner May 15, 2025 Maintainer

Other notes

Uh oh!

Uh oh!

js2702 May 19, 2025 Author

js2702
May 14, 2025

Replies: 1 comment 1 reply

rkistner
May 15, 2025
Maintainer

js2702 May 19, 2025
Author