-
Notifications
You must be signed in to change notification settings - Fork 7
How rails sharding connection handling works
First of all, to understand what the rails-sharding
gem does, it is necessary to understand how ActiveRecord manages database connections, and how we can use its existing interface to connect to shards.
ActiveRecord v5 defines the following classes hierarchy:
ConnectionHandler -[has many]-> ConnectionPool -[has many]-> Connection (Adapter-specific classes)
These are the main components of AR database connection management, so let's start understanding what they do.
This is a high-level interface to database connection management. A common Rails application should have only one ConnectionHandler instance, that can be accessed via ActiveRecord::Base.connection_handler
. The actual connection_handler
getter is defined in ActiveRecord::Core#L134, which is one of the modules included into ActiveRecord::Base.
A ConnectionHandler can manage several ConnectionPools, typically one per database that you wish to connect to. In other words, a vanilla Rails app should have a single ConnectionHandler with a single ConnectionPool.
Besides holding several ConnectionPools, the ConnectionHandler also implements the feature of automatically dealing with forked processes. When a Rails app is forked, the child (new) process inherits all file descriptors from its parent process, including database connections, which ultimately leads to an issue known as file descriptor sharing (you can read more about it here). The solution to this problem is for the child process to avoid using the inherited connections after the fork. The way the ConnectionHandler solves this is by detecting the fork automatically and instantiating new ConnectionPools for the child process, with the same specification as the parent's pools, but initially empty (without any connections). This is completely transparent to the ConnectionHandler user.
To create a new ConnectionPool inside the ConnectionHandler you need to call ConnectionHandler#establish_connection, which is a poor name for this method IMHO, as the method doesn't actually establish a connection to the database, it simply creates an empty ConnectionPool associated with a spec_name (in Rails 5.0 spec_name was a fixed string "primary", but now in Rails 5.1 it has changed to be the spec key on the database.yml, e.g "development", "production", "staging").
Once the pool is created, you can retrieve it by spec_name using ConnectionHandler#retrieve_connection_pool, or retrieve the DB connection directly using ConnectionHandler#retrieve_connection
ActiveRecord::Base.connection_handler.retrieve_connection_pool("primary")
ActiveRecord::Base.connection_handler.retrieve_connection("primary")
Of course, Rails users don't usually know about the ConnectionHandler, and they also don't know things such as the spec_name of the database they're using. Therefore, Rails sugar-coats this process by including the ConnectionHandling module to ActiveRecord::Base, which makes accessing DB connections much simpler:
ActiveRecord::Base.connection_pool
ActiveRecord::Base.connection
ActiveRecord::Base.with_connection { |connection| ... }
Also, when you have different models being saved in different databases, the ConnectionHandling module already does the trick for you of returning the connection pool/connection to the database where the model is stored:
ModelA::Base.connection_pool #=> connection pool of the DB where ModelA is
ModelB::Base.connection_pool #=> connection pool of the DB where ModelB is
The case above is an example of a single ConnectionHandler managing multiple ConnectionPools.
The ConnectionPool is responsible for managing and allocating DB connections to threads. When you define your connection in the config/database.yml
you specify the pool size for each database using the pool option (if not, the default is 5). This is the maximum number of database connections a ConnectionPool to that database supports.
The ConnectionPool is lazy. It starts with 0 available connections, and as connections are requested by threads, they are created and checked out of the available pool. When the pool runs out of connections, it will make the requester thread wait for a given timeout. If it timeouts, it will then raise a ConnectionTimeoutError exception here. You can control the timeout duration with the checkout_timeout option in your database specification in config/database.yml
If you're into this kind of stuff you can play with the ConnectionPool directly using the the checkout
and checkin
methods:
connection_pool = ActiveRecord::Base.connection_pool
connection_1 = connection_pool.checkout
connection_2 = connection_pool.checkout
connection_pool.checkin connection_1
connection_pool.checkin connection_2
But be warned! If you forget to check-in the connection it will stay reserved and unavailable for future checkouts, and you run the risk of running out of connections in your pool.
This is one of the reasons you should probably not use the checkout
and checkin
methods directly, but rather the higher-level methods connection
and release_connection
(these are the methods used by Rails to retrieve a connection). The main difference is that connection
and release_connection
will associate a checked-out connection to the current thread, and will avoid reserving more than one connection per thread. This means that if you forget to release_connection
after you have finished using the connection, at least, the next time you request for a connection you will get the same one as before, instead of checking-out a new one:
connection_pool = ActiveRecord::Base.connection_pool
connection_1 = connection_pool.connection
connection_2 = connection_pool.connection
connection_1 == connection_2 #=> true
In fact, that is how ActiveRecord::Base uses the primary ConnectionPool. When you load some record like User.first
ActiveRecord will request a connection to the ConnectionPool, and it will not call release_connection
later. This is actually a sensible thing to do when you have a single database, as it improves performance when you retrieve the connection next time (it stays cached). However, this might be a problem if you're accessing thousands of different databases using different threads.
In terms of using a ConnectionPool directly, it is more memory-efficient to use the with_connection
method instead of the pair connection
, release_connection
. This ensures the connection is checked-in to the poll after the block runs, even if the block raised some exception:
connection_pool = ActiveRecord::Base.connection_pool
connection_pool.with_connection do |connection|
# do your work here
end
# connection has been checked-in again when we get here
Just remember that a checked-in connection is not a closed connection, it just means the connection has returned to the pool of available connections to be checked-out by another thread. This leads to better memory utilization but does not free memory directly.
Another interesting feature of the ConnectionPool is the Reaper. It runs periodically (with frequency configurable through the reaping_frequency config) and it identifies connections checked-out by threads that are no longer alive, and checks them in again. The Reaper does not run by default, only if you specify the reaping_frequency option.
As of now, the ConnectionPool doesn't have any methods to actually disconnect and clear stale connections sitting in the pool for too long (this was included in #31221 and will probably be released with rails 5.2, and will work when the reaper is activated). This might be relevant if you are accessing a large number of databases, as every connection ever opened to any database ever queried will be kept open forever. This results in a larger memory footprint of your Rails process and might result in your DB running out of open connection slots (usually in the order of hundreds to thousands for a database). Of course, you will only have this kind of problem if you're accessing a very large number of databases or if you have very large connection pools.
You can manually and individually remove connections from the ConnectionPool using ConnectionPool#remove
, but it does not close the connection nor checks whether the connection is being used or not, it simply removes the connection from the pool management.
There is a conection option called wait_timeout
, which can be set in your database.yml
, that represents the number of seconds it takes for the database to close an idle conection automatically. Rails currently sets this configuration to a very large number by default (see code) so it doesn't have to worry with timeouted connections on the connection pools. If you're getting close to the max number of connection you database supports, it is worth looking into this option.
The connection class is defined for each of the DB adapters (MySQL, Postgres, etc) and it holds a raw_connection
, which is a DB client defined by some external gem (mysql2
for MySQL, pg
for Postgres). You can run queries using the Connection#execute
and you can do some simple connection management with Connection#disconnect!
, Connection#active?
and Connection#reconnect
.
If you want to actually close/disconnect a connection, it is important to check it in first to the ConnectionPool, then remove it from the ConnectionPool, and finally you can call disconnect!
on it. If you do remove it from the pool first, it will eventually be served to some thread, and it will crash your thread when used.
Now that you understand what ActiveRecord offers us, you're ready to understand how the rails-sharding
gem manages connections to different shards.
Based on a configuration file (config/shards.yml
) the gem defines a new (and static) ConnectionHandler just to deal with shard connections. In other words, a completely separate handler from the one Rails uses.
ActiveRecord::Base.connection_handler # => ConnectionHandler for Rails
Rails::Sharding::ConnectionHandler # => static ConnectionHandler for shards
At Rails initialization, the gem creates a ConnectionPool for each shard (in all shard groups) by calling Rails::Sharding::ConnectionHandler#establish_connection
for each shard, the same way ActiveRecord does for the primary database.
You can then retrieve connections or connection pools for shards using:
Rails::Sharding::ConnectionHandler.connection_pool(shard_group, shard_name)
Rails::Sharding::ConnectionHandler.retrieve_connection(shard_group, shard_name)
Rails::Sharding::ConnectionHandler.with_connection(shard_group, shard_name) { |connection| ... }
And the connection pools will work as usual, opening connections when necessary and allocating them to threads that request them.
Finally, the missing piece to make it all work is the module Rails::Sharding::ShardableModel, that will work as a ConnectionPool switcher, making the model access the primary ConnectionPool or some of the shards' ConnectionPools, depending on the scope.
Usually it is not. You can manage very large Rails applications with or without shards and never have to know any of this. However, if you're connecting to a very large number of shards, or your databases are already near the limit of open connections, you will probably have to deal with that yourself, and learn to manage connection pools and connections manually.