slug | description |
---|---|
/ |
Vitess for Postgres. |
Multigres is a project to build an adaptation of Vitess for Postgres.
This document is a loosely ordered tentative list of features we intend to build or import from Vitess.
Multigres will have a two-level proxy layer. In the case of a single small server, the primary benefit of these two layers would be connection pooling.
MultiGateway is the top layer. In its simplest form, it will masquerade as a Postgres server. When it receives a query, it will forward the request to the next (MultiPooler) layer. In the case of a single Postgres server, its primary usefulness is to shield the clients from restarts or failovers that may happen in the underlying layers due to software rollouts and failures.
In the case of master-replica configurations, MultiGateway can be configured to send read traffic to the replicas.
The MultiGateway layer could be conceptualized as the compute layer of Multigres. It can be horizontally scaled as needed.
There will be one MultiPooler per Postgres instance. MultiPooler's primary function is to provide connection pooling. It will be aware of transaction state and connection specific changes of state, and will preserve the correct behavior to the clients connected at the MultiGateway level.
Each MultiPooler, its associated Postgres instance, and its storage are treated as one unit. This trio can be conceptualized as one node in the storage layer. In this layer, data can be replicated and/or sharded across other (trio) nodes as needed. Traditional storage layers typically implement a file or object store API. In the case of Multigres, the storage layer API is that of a database.
As the data grows, you will soon encounter the need to split some tables into shards. When this need arises, Multigres will manage the workflows to split your monolithic Postgres instance into smaller parts. Typically, the smaller tables will remain in the original Postgres database, and the bigger tables will be migrated out into multiple (sharded) databases.
Multigres will provide a powerful relational sharding model that will help you keep tables along with their related rows in the same shard, providing optimum efficiency and performance as your data grows.
TODO: Doc on Vitess sharding model
In a sharded setup, MultiGateway's functionality will expand to present this distributed cluster as if it was a single Postgres server. For simpler queries, it will just act as a routing layer by sending it to where the data is. For more complex queries, it will act as a database engine while maximally leveraging the capabilities of the individual databases underneath.
MultiGateway will have the ability to push entire join queries into an underlying shard if it determines that all data for that join is within that shard. Similarly, MultiGateway will "scatter" entire joins across all shards if it determines that the related rows of a join are within their respective shards.
The query serving functionality of MultiPooler will have no awareness of sharding. It will just use the underlying Postgres instance to serve the requested queries. However, MultiPooler will be the workhorse behind facilitating all the resharding efforts.
When the need to reshard arises, new (target) MultiPoolers will be created to receive filtered data from the original unsharded table. This data will be streamed by the source MultiPooler. Once these tables are populated and up-to-date, a failover will be performed to move traffic to the sharded MultiPoolers. For safety, the replication will be reversed. If any issue is found after the cut-over, it can be undone by switching traffic back to the source tables.
This failover and fail-back can be repeatedly performed without requiring any change in the application.
Due to the flexible sharding scheme of Multigres, you should be able minimize or completely eliminate the need for distributed transactions by a careful selection of an optimal sharding scheme. However, if the need arises, MultiGateway and MultiPooler will work together and use the two-phase commit protocol supported by Postgres to complete transactions that span across multiple instances.
TODO: Doc on 2PC atomicity and isolation
Multigres allows for the Postgres data files to be stored on the local NVME. For an OLTP system like Multigres, a local NVME has substantial advantages over a mounted drive:
- IOPS are free
- Lower latency, by one order of magnitude
These advantages translate into higher performance and reduced cost.
Multigres will implement a consensus protocol to ensure that transactions satisfy the required durability policy. This will be achieved using a two-phase sync to replicate the WAL to the required number of replicas in the quorum. This functionality eliminates the need to rely on a mounted (and replicated) file system like EBS to safeguard from data loss.
In case of a primary node failure, MultiOrch (Multigres Orchestrator) will promote an existing replica to primary, ensuring that it contains all the committed transactions of the previous primary. Following this, Multigres will resume serving of traffic using the new primary. Essentially, this solves the problem of durability and high availability.
The above approach aligns with the conceptual view of the MultiPooler+Postgres+Data trio as being part of the data layer. Hence, there is no need to rely on yet another (mounted) data layer underneath Postgres.
It is certainly possible to run Multigres on a mounted drive. It is just not necessary.
Multigres will come equipped with a variety of tools and automation to facilitate the management of your cluster. This tooling is capable of scaling to tens of thousands of nodes, and can be spread across multiple regions worldwide.
Multigres can automatically perform backups of your Postgres instances on a regular basis to a centralized location.
Adding more replicas to a Postgres instance will be as trivial as launching a MultiPooler with the right command line arguments. The MultiPooler will seek the latest available backup, restore it, and point the Postgres instance to the current primary. Once the replication has caught up, it will open itself up to serve read traffic.
MultiPoolers can be of different types: They can be quorum participants, in which case they will follow the two-phase sync protocol during replication. If not, they will just be "observers", to just serve additional read traffic.
Bringing up a MultiPooler as a quorum participant will automatically make it join the quorum after it has fully caught up on replication.
MultiPoolers will have configurable user-defined tags that will allow MultiGateways to split traffic based on differing workloads. For example, you may tag a group of MultiPoolers as "olap". Workloads that would like to perform analytics queries would specify "dbname@olap" as the database to connect to. This will inform the MultiGateways to redirect these reads to only MultiPoolers that export the "olap" tag.
Beyond managing the consensus protocol for MultiPoolers in the quorum, MultiOrch will also monitor all MultiPoolers and ensure that they are replicating from their respective primaries. If a connection is lost for any reason, MultiOrch will restore it.
Multigres can be deployed across a large number of nodes distributed across the world. This is architecturally achieved by a star configuration in the topology. There will be a global topology server (etcd). It will contain information that changes infrequently, like sharding info, etc. This information will be propagated into cell-specific topology servers (local etcds), one for each cell. MultiGateways and MultiPoolers will be deployed in a cell, and they'll use the local topology to serve traffic for that cell. In case of a network partition, a cell will be capable of continuing to serve traffic for as long as the replication lag is tolerable.
Of course, there can exist only one Primary per shard. Any cell that does not host the primary database is meant to serve read-only traffic that is replicated from the primary.
Multigres will be cloud-native. This means that it can run in a cloud framework like Kubernetes:
- All components are restartable without causing disruption to the overall system. This needs to be within a reasonable disruption budget, which is configurable.
- Components can be migrated to different nodes.
- Components can use mounted persistent storage, and will automatically reattach if moved.
- New components can be launched to increase capacity. They will initialize themselves correctly, and join the system to serve traffic. Conversely, components can be removed in order to scale down.
- Specific to databases: Primary failures and restarts will be handled automatically by a watchdog process (MultiOrch) that can perform a failover to an up-to-date replica.
- Scale to zero: You can shut down all components. This will result in just the metadata and backups being preserved, with no active components to serve any traffic. Adding the serving components back to the system will bootstrap the cluster to an operational state.
Multigres will come with a Kubernetes Operator that will translate components described using Multigres terminology like cells, shards and replicas into corresponding Kubernetes components.
We intend to develop a way for Multigres to use local storage in order to leverage the proximity of Postgres and its data files.
TODO: Doc to elaborate on local NVME
The next few features described below will all be powered by a primitive called Materializer. This primitive is essentially a stream processor capable of materializing a source of tables into a target set of tables. The rules of materialization can be expressed as an SQL statement. The only limitation is that the expression is stream processable. In other words, we should be able to incrementally and independently apply every change event to the target. For example, a count(*)
is stream processable, whereas a max(col)
is not.
This materialization can happen without causing any downtime to the source database. Once the target tables are caught up to the sources, Materializer will verify correctness of the target tables by performing a diff.
If the materialization expression is non-lossy (reversible), you can atomically switch traffic from the source tables to the target. At this point, Materializer can reverse the replication to keep the source tables up-to-date. You can go back and forth indefinitely. This ability allows you to undo a migration if problems are encountered after a cutover.
Materializer streams operate orthogonally to each other. For example, resharding uses Materializer, but a simultaneous table materialization will correctly migrate to use the new shards after resharding.
Materializer works in conjunction with MultiGateway routing rules that allow you to smoothly and safely transition traffic from source to target.
MoveTables will be a workflow built using Materializer. This will allow you to migrate a group of tables from any source to any target. This can be used to split a database into smaller parts, or merge two databases into one. Let us assume that you want to migrate table t
from database a
to database b
.
- Initially, the MultiGateways will have a rule to redirect traffic intended for
a.t
andb.t
toa.t
. MoveTables will setup these routing rules automatically before starting Materializer. - Once this rule is setup, you can start refactoring your application to write to
b.t
instead ofa.t
. Writing to any of these tables will get redirected toa.t
. - Once the data is verified to be correct, we can switch the MultiGateway routing rules to send traffic to
b.t
instead. MoveTables will have a subcommand to do this safely. If the application has not finished refactoring all the code, it's ok, because all traffic will flow tob.t
. - At this time, Materializer will reverse the replication to keep
a.t
up-to-date. If any problem is detected after the cutover, you can fall back toa.t
. This back and forth can be repeated as often as necessary. - After we are certain that all problems are resolved, and we have verified that the application refactor is complete, we can use the MoveTables clean up command to drop the reverse replication, routing rules, and source table
a.t
.
Of course, this workflow would require MultiGateways to have access to both a
and b
.
Materializer will rely on Postgres logical replication. Due to this design, we can also use it to safely migrate from one version of Postgres to another without incurring downtime, and with the ability to revert a migration, just like the case of a table migration.
This will essentially be a MoveTables, but with the source and target running different versions of Postgres.
Materializer will be used for resharding. For this purpose, Multigres will use the filtering ability of Materializer to split sharded (or unsharded) tables into their target shards based on the target's sharding key ranges.
Just like the other cases, the ability to verify correctness and revert will be available.
It may happen that the original sharding key you chose for the table was suboptimal, or it may be possible that the application workload has changed substantially. In such cases, you can use Materializer to change the sharding key of a table to one that is more optimal to the current workload.
If you are currently running on a MySQL database, or Vitess, Multigres will allow you to migrate from that source. In this situation, additional work may have to be done by you in the area of how you switch traffic. This is because Multigres will only support the Postgres protocol, and will not be able to orchestrate the traffic meant for the source MySQL.
Materializer is versatile enough that it can be used to address migration use cases that have not been thought of yet. For example, you could script it to merge tables from different databases into one. This can be a necessity if you decide to merge many multi-tenant databases into a single sharded one, or vice-versa.
Multigres materialized views will be built using Materializer. Such views are physical tables that are updated in real-time by processing events from the WAL. Consequently, the complexity of the SQL expression for these views is limited to operations that can be incrementally processed from the change events. For example, count(*)
will be supported, while max(col)
will not be. Another trade-off of materialization is that it is eventually consistent.
In a sharded environment, materialized views can greatly enhance the performance of certain join operations, if the eventual consistency trade-off is acceptable:
- For a table that has foreign keys into two different tables, you can have the source table be sharded by the first foreign key, and you can materialize the target table using the second sharding key. This approach will allow you to efficiently join this table with either of the two other tables, because both those joins will be in-shard joins.
- Reference tables: Many databases have smaller tables that have slow changing data. However, they may need to be joined with other massive tables that need to be sharded. In this situation, you can materialize the small table into all the shards of the bigger table. Such a join will then become an in-shard join.
Some schema changes are inherently time-consuming, along with the downside of locking the table from other changes. In such situations, Multigres will use Materializer to materialize the post-schema version of the table within the same database. Once the materialization is complete, it will swap the old table with the new one, thereby achieving a non-blocking, zero-downtime deployment.
Additionally, the materialization will be reversed. If the schema deployment causes problems, you have the option to instantly revert with no data loss.
Materializer will be made up of two parts: the part that streams from the source (MStream), and the part that writes to the target (MPlayer). MStream will also speak the Postgres logical replication protocol. This will allow you to integrate multigres with any tooling that can consume such a stream.
If you use MStream, you do not need to perform a dump and restore. Instead, you can initiate a stream "from the beginning of time". With this directive, MStream will provide all the events necessary to completely materialize the source into the target.
As your database grows into multiple shards, you can run individual MStreams for each shard for better parallelization.
Apart from exporting real-time metrics, Multigres will also have an extensive toolset to facilitate troubleshooting when the system exhibits unexpected behavior. These will be exported by MultiGateway and MultiPooler:
- Standard metrics like QPS, latencies, and error rates.
- Per-table metrics.
- Normalized per-query metrics. In this case, each query is normalized by stripping out values and it becomes a key to its own metric. This helps identify the worst performing queries. This can lead to an explosion of values. To help make this optional, this metric is served in a separate human readable end point.
- On-demand real-time query or transaction logs. This is useful to get a snapshot of everything that's currently in progress, useful if there's an active incident.
- Error logs.
- Trace points.
Multigres will be able to restore any shard, or an entire cluster, into a specially named read-only database allowing you to view a snapshot of the data as of a certain time. This is achieved by restoring an old enough backup and applying logs up to the necessary timestamp.
If multiple shards are restored, there is no guarantee of transaction consistency across shards.
The Messaging feature will implement transactional message queues that can guarantee that every message is processed. The message queue will keep resending unacknowledged messages while exponentially backing off, until an acknowledgement is received.
Rows can be transactionally inserted into a message table as part of a larger transaction that involves other tables.
Multigres will implement the ability to enable various database protection features:
- Query Killer: Terminate a query if it takes too long.
- Transaction Killer: Rollback a transaction if it takes too long.
- Result limiter: Return an error if the number of rows in a query exceed a threshold.
- Result consolidator: If multiple identical read queries are simultaneously sent to the database, only one query is executed, and the results are shared across all other requests.
If any of these features are already available in Postgres, Multigres will utilize them under the covers.
MultiAdmin will be a dashboard that allows you to view the various components of a live cluster, like the currently running MultiPoolers, browse to their status pages, etc.