Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

transactions and different db servers #264

Closed
apneadiving opened this issue Oct 5, 2015 · 56 comments · Fixed by #271
Closed

transactions and different db servers #264

apneadiving opened this issue Oct 5, 2015 · 56 comments · Fixed by #271

Comments

@apneadiving
Copy link
Contributor

Hi and thanks for the gem :)

Here are some questions I have concerning its use. Please be aware it would be possible to see them as a freelancing mission.

Store tenants in different databases servers

In the current state of the gem, it seems the assumption is that all tenants are stored in the same db server. Am I right to assume this? Is it planned to support different servers?

Cross tenant transactions

I realized transactions work fine whenever ran across different schemas of the same db.

But I cant make them work whenever tenants are in different db (undefined method 'rollback' for #<ActiveRecord::ConnectionAdapters::ClosedTransaction:0x007ffef036a320>).

Could you please provide some guidance?

Thanks in advance

@bradrobertson
Copy link
Contributor

Different db servers
As it stands, there's no official way of separating tenants physically (called physical sharding) There is a plugin being written for postgres that could allow for this, but there's an open issue to support sharding by schema name. That effectively would allow physical sharding.

Cross tenant transactions
Your question seems to suggest that you already solved the first point? You have tenants across different dbs? Cross tenant transactions on the same db server are for sure fine, but again, since we haven't solve the above issue so I don't know how you'd even attempt a cross database transaction. It should be noted that distributed transaction systems are pretty complex and not really part of this gem (or part of postgres itself). This problem should/would be solved at the db layer rather than gem layer though I'm not aware of anything in postgres/mysql that would allow for this.

@apneadiving
Copy link
Contributor Author

different db servers

ok I understand

cross tenant transactions

Yes for tenants stored as schemas, its quite straight forward: everything can be wrapped in an ActiveRecord::Base.transaction block.
Whenever I use use_schemas = false I tried to do (with Post being tenant based):

Apartment::Tenant.switch('tenant1') do 
  Post.transaction do
    Post.create!
    Apartment::Tenant.switch('tenant2') do 
      Post.transaction do
        Post.create!
      end
    end
  end
end

I tried this way too:

Apartment::Tenant.switch!('tenant1')
Post.transaction do
  Post.create!

  Apartment::Tenant.switch!('tenant2')
  Post.transaction do
    Post.create!
  end
end

But it seems tenant1 connection is closed whenever tenant2 is opened, because even when data is valid, there is no commit for the first Post in tenant1.

@bradrobertson
Copy link
Contributor

Ah, sorry I see what you're doing. Ya you can't do cross-connection transactions. When you're not using use_schemas ActiveRecord establishes a new connection for each tenant switch. You can't have a transaction span different db connections so this won't work.

We're likely going to sunset the connection based strategies soon and strip them out to a separate gem for legacy systems as it's really not the recommended way of doing things.

@apneadiving
Copy link
Contributor Author

Different db servers

Whenever use_schemas is false, on switch, it seems you create a connection to the desired db. How come it couldnt be on a different server? It seemed to me its only a matter of connection options (host, username etc which could be stored in the tenants tables), could you tell me what I'm missing please?

I wish there was a roadmap concerning the issue you opened on pg_shard :)

cross tenant transactions
ok :)

@bradrobertson
Copy link
Contributor

yup, you're right, it's literally making a new connection so it would just be a matter of switching those credentials to a new machine.

So there's nothing stopping us, it's just never been implemented. You're welcome to submit a PR for something if you like. Basically we'd need a map of 'tenant_name' => 'db credentials' so that we can lookup the config for that tenant. Might not be too hard.

@apneadiving
Copy link
Contributor Author

Ok I'll look into it :)

Thanks for the lightning quick answers

@apneadiving
Copy link
Contributor Author

I've just made another test just to check for curiosity's sake.
I have use_schemas set to false and User is in excluded_models.

Then these work (and rollback properly if I raise an error in the blocks):

Post.transaction do 
  Post.create!
  User.transaction do 
    User.create!
  end
end

User.transaction do 
  User.create!
  Post.transaction do 
    Post.create!
  end
end

I'm not sure I get the difference with the case where we have two connections on different tenants.
Is this code just working by accident?

@bradrobertson
Copy link
Contributor

That shouldn't work, but this is all just ActiveRecord w/ multiple connections, nothing to do with Apartment. I wouldn't rely on it.

@apneadiving
Copy link
Contributor Author

I suggest to add the following config:

config.tenant_with_config = true
config.tenant_with_config_options = {
  class: Tenant,
  name_column: 'name'
}
# then I guess config.tenant_names is not useful ? 

If tenant_with_config is true:
1- use_schemas must be false
2- I would allow calls like Apartment::Tenant.switch!('tenant1'), it would find tenant by name (based on tenant_with_config_options) then delegate to method below
3- I would create Apartment::Tenant.switch_with_options!(tenant1), where tenant1 must respond to a method to provide actual db config, say db_config
4- elevators use would not be changed, since I guess it goes through (2)
5- I'd say db_config must return a hash. I'd only take its non nil values and merge them with the default config. This would let users have tenants on the main connection by default, useful for retro compatibility
6- I would not allow Apartment::Tenant.create('tenant_name'), and would require Apartment::Tenant.create(tenant). Because we wouldnt interfere with the structure of the model used to represent the tenants
7 - I'll have to work on the migrator, not sure whats involved there

@bradrobertson
Copy link
Contributor

Wow that's a lot of changes. What if tenant_names could just return a Hash, OR an Array and if it's a hash, they key represents the tenant name and the value is the connection options that would be passed down to the adapter? Then nothing about the public interface needs to change.

@apneadiving
Copy link
Contributor Author

Actually only the Apartment::Tenant.create public api would have to change changed.

Doing your way, we'd simply not need additional config, and we'd have:

1- I guess we still should check use_schemas is false if tenant_names is a Hash
2, 3 - its just internal actually
6 - api change still needed, maybe Apartment::Tenant.create(tenant_name: db_config_hash)

@bradrobertson
Copy link
Contributor

If your tenant_names method returned the proper config for my-new-tenant, then you wouldn't need this to happen. Basically, you'd add the record into your Tenant model that has name, connection or whatever, then when you call create it'd just do the lookup.

Unless i'm missing something?

@apneadiving
Copy link
Contributor Author

Some clients of mine want their data to be fully isolated from the rest (HIPAA constraint).
So basically I could create a db then create the new tenant.
If I can only rely on the tenant_names list, then I guess I'd have to redeploy to get a new db visible in the stack.

@emilebosch
Copy link

Hi @apneadiving I have the same constraint! From our pentesters. Wanna go and hack on this together? It would be awesome if switching from fully isolated to schema would be frictionless!

@apneadiving
Copy link
Contributor Author

@emilebosch once we agree about the api with @bradrobertson we could start. Pretty cool you're in the same timezone. We could definitely work together on this, I intend to have it done this week.

@emilebosch
Copy link

Awesome Ok i have to go run back in a meeting but we'll be shortly back.

@bradrobertson
Copy link
Contributor

@apneadiving I'm not sure why you'd need to redeploy. Here's how I'd do it:

  1. Create the new physical machine (@ postgres://some-pg-url/dbname for example)
  2. Add an entry into your Tenant model name: 'whatever', connection: 'postgres://some-pg-url/dbname'
  3. tenant_names will be dynamically generated if you give it a block something like
Apartment.configure do |config|
  config.tenant_names = ->{ Hash[Tenant.pluck(:name, :connection)] }
end

And it would immediately be available

@emilebosch
Copy link

This is super awesome!

A couple of things functional requirements from my side: I don't know if it applies to you guys but:

  • switching from schema to db should be frictionless.
  • it would be cool if a tenant create would actually create and schema load the db. As for the credentials I think we should either provide them from the model or use some convention over configuration magic making a setup very easy to do.
  • the user creating the database also should make sure that the tenant has no rights to switch databases preventing reading all the others tenants' data. It should basically only have rights to create a db. Passwords should not be readable by an sql injection . We can do this by making the password from the tenant name + secret hash for instance.
  • migrations across all tenants should be atomic

Are these good points?

@emilebosch
Copy link

Also the actual requirements are: prevent leaking data from other tenants in case of a SQL injection breach

@mikecmpbll
Copy link
Collaborator

One of the biggest problems you'll face is making it behave with threads for folks using a threaded app server or things like sidekiq.

@bradrobertson
Copy link
Contributor

ActiveRecord connection pool is already thread safe.

@mikecmpbll
Copy link
Collaborator

How will you be switching connections between tenants? Unless you know something I don't (which is entirely likely 😄) you're not able to use the same AR class with two different database servers concurrently, due to the way that the connection pool is inextricably linked to the AR class and multiple DB servers need different connection pools.

This is exactly the problem in #180 afaict.

@apneadiving
Copy link
Contributor Author

@bradrobertson ah ok, yes you're right Apartment::Tenant.create Doesnt have to create the ovject in db. All clear then, I can start to work.

@emilebosch you have a lot of requirements, most of them join mine, but I'm not sure some are possible (like frictionless schema or db). I'll start with my scope to get things started.

@bradrobertson
Copy link
Contributor

@mikecmpbll You may be entirely correct. I don't have a firm knowledge of the AR connection handling other than I've read that it's thread safe. I was under the impression that there was actually a pool of pools, 1 for each establish_connection call that would maintain open connections to all physical machines. Then, each thread would check out a connection to make its requests on, thus different threads could connection to different physical machines. I haven't actually played with that though and you're right #180 is basically saying this isn't the case :(

@apneadiving
Copy link
Contributor Author

I'm not sure I understand the limitation: apartment already switches connections between db.
Changing host or user is that different from changing db name only?

@bradrobertson
Copy link
Contributor

What Mike is suggesting is that ActiveRecord connection pooling is not actually thread safe, ie that if 2 different threads call switch to 2 different physical hosts at the same time, it's possible that 1 thread could actually use the connection of the other. I haven't verified this in any way, but it's definitely something for ActiveRecord to solve, not Apartment.

@mikecmpbll
Copy link
Collaborator

ConnectionPool is threadsafe; threads will pick up their own connections from the connection pool and everybody's happy. However, ConnectionPool holds it's own ConnectionSpecification, with the connection info. If we want to connect to some other database, we have to establish a new connection pool.

The ConnectionPool that AR uses is resolved hierarchically from the class you are querying, so if you query Post.all and do not have any connection information in Post (establish_connection ...), it looks for the connection pool that's assigned to AR::Base.

The short implementation of establish_connection makes it pretty obvious to see how it links the class (owner) with the newly established pool. What this means in practice is that establish_connection is not threadsafe because all our threads are looking to our AR class for the connection pool.

Rails does all their threadsafe connection handling within the pool, so they need the pool to be shared across threads, which is fine most of the time. The real problem as far as I can see it is the rigidity of the mapping between class and connection specification.

I hope that makes sense, and if anyone's an expert on this and wants to tell me I've got it all wrong I'd be the first to welcome that too! Like Brad says though, this is a Rails issue and needs to be sorted there. I've been meaning to hack on it for a few months, maybe this weekend will be the start.

@apneadiving
Copy link
Contributor Author

Ok I guess I kind of understand thanks to the explanations :)

So the problem can arise whenever we use multithreaded processes like Sidekiq or Puma/Unicorn. But it would be fine with single threaded one like Thin or DelayedJob?

It fails no matter what the value of use_schemas right?

@emilebosch
Copy link

ok @mikecmpbll guess you're now captain awkward thread bugs. Is a connection pool per tenant what you're suggesting?

@mikecmpbll
Copy link
Collaborator

@apneadiving :

So the problem can arise whenever we use multithreaded processes like Sidekiq or Puma/Unicorn. But it would be fine with single threaded one like Thin or DelayedJob?

Yes, indeed.

It fails no matter what the value of use_schemas right?

No, there is no threading problem with postgres schemas, or multiple databases on the same server with mysql (both termed 'schemas' in Apartment for their similarities). The problem is with the adapters that use establish_connection.

@emilebosch : what I'm suggesting is that this is extremely difficult to do safely without some changes to ActiveRecord connection handling.

@emilebosch
Copy link

@mikecmpbll If we monkey hack establish_connection and make it tenant aware. Would that work?

@mikecmpbll
Copy link
Collaborator

I don't think a monkey patch of Rails will be a suitable solution, not least one that inserts Apartment logic into it.

I'll fork AR and see if I can make any tracks with it this weekend but can't guarantee I'll come up with anything.. heh

@apneadiving
Copy link
Contributor Author

Actually rereading @mikecmpbll explanations, I understand my expectations cant be met because of active record's pooling.
Because a model meant to be used per tenant and a model meant to be used globally share the same pool because they both inherit from ActiveRecord::Base.

What confuses me is that using Postgresql + use_schemas set to false seems to work on my machine. Can I simply deduce multiple databases on the same server with postgresql is a single connection? (same as what @mikecmpbll declared for mysql?)

In order to be able to have tenants on different servers, maybe I could try to have them inherit from a TenantDatabase base class:

class TenantDatabase < ActiveRecord::Base
  self.abstract_class = true
  establish_connection not_sure_what_default_to_have_here
end

If I understand properly, its meant to let us have a separated pool of connections.
This could solve:

  • be able to have a connection to global database + to one other tenant in a single threaded environment

It wouldnt solve:

  • be able to have a connection to global database + to several other tenants in a single threaded environment
  • be able to have a connection to global database + to one or more tenants in a multi threaded environment

@mikecmpbll
Copy link
Collaborator

What confuses me is that using Postgresql + use_schemas set to false seems to work on my machine. Can I simply deduce multiple databases on the same server with postgresql is a single connection? (same as what @mikecmpbll declared for mysql?)

Have you tried the test in #180? The multiple database model (use_schemas = false) for Postgresql uses establish_connection so it should be susceptible.

You could split different models between different servers using the abstract class technique, but I don't see how you could do tenant sharding.

@apneadiving
Copy link
Contributor Author

@mikecmpbll My assumption was: an abstract class would have its own connection. So I'd have Global connection and Tenant connection.
From this, I deduced we could have single threaded tenant sharding, changing the Tenant connection whenever needed.

Maybe I've got it wrong though :)

@apneadiving
Copy link
Contributor Author

@bradrobertson
I just figured this: https://github.com/apneadiving/apartment/blob/development/lib/apartment/adapters/abstract_adapter.rb#L125

Because of it, all excluded models have a different connection. Doesn't it somehow prevent from pooling? I feel like its not that desirable, why did you make this choice please?

@bradrobertson
Copy link
Contributor

That doesn't prevent pooling no. Each establish connection will establish a pool of connections.

Again we don't use connection based adapters at all so I can't really comment much here. Our excluded models are maintained in a different schema using the same connection.

Sorry guys I don't really have the time to support you on this one. If you come up with something I'm happy to review it but I'm otherwise too busy to keep fielding these questions.

@apneadiving
Copy link
Contributor Author

@bradrobertson no problem I completely understand and am already happy to find a solid base to work on.

I'll try to organize the excluded under a same parent and the tenant based under another parent: maintaining one single connection for each. But you'll have the time to check in a pull request to come (I just hope I can do it without changing too much)

@bradrobertson
Copy link
Contributor

ya I can review PRs

@apneadiving
Copy link
Contributor Author

For history's sake, each call to establish_connection seems to be meant to create a new pool rails/rails#7019 (comment).

I havent found a more recent resource but it feels like the principle will remain. So I think we'd rather avoid this.

@mikecmpbll
Copy link
Collaborator

@apneadiving yep, I touched on that previously.

there's nothing particularly wrong or bothersome with creating a new connection pool for each excluded model and it would introduce some quite considerable complexity to do it any other way; I don't see it providing any great value.

@apneadiving
Copy link
Contributor Author

@mikecmpbll ah, interesting, you dont think there would be benefit in having one abstract superclass for all tenanted models and one for all shared models?

@mikecmpbll
Copy link
Collaborator

I don't see any real terms advantage of that, no.

@apneadiving
Copy link
Contributor Author

@mikecmpbll ok :) I guess I dont understand it properly then. Say if you call establish_connection in x models. Then need data from those in an action, we'd end up with x connections and their pools. My lack of knowledge would lead me to feel like I'd prefer 1 connection with 1 pool. I'm not sure where to read about this.

@mikecmpbll
Copy link
Collaborator

perhaps, but connections aren't very expensive. it might be able to be improved but I don't know whether that would be worth the added complexity. also, it's not a real world problem that anyone has so in terms of priorities it doesn't really feature for me, i'm more interested in addressing this multi-database threading issue.

probably easiest to hit me up on #rubyonrails on freenode if you wanna discuss more (same name) 😊

@mikecmpbll
Copy link
Collaborator

right. thanks @apneadiving for discussing with me on IRC. we had a bit of a breakthrough and it looks like we can solve the threading issue without any changes to AR.

however, to say I'm not a lover of rspec would be an understatement—and this means I can't work out how to create a failing test for #180 :( if someone can help me with writing an rspec test where i can switch between multiple tenants, then I'll be off and running.

@apneadiving
Copy link
Contributor Author

@mikecmpbll here is one directly inspired from #180 https://gist.github.com/apneadiving/6adcb32ce7fec82d0653

It fails with:

 Failure/Error: expect(tenant).to eq Apartment::Tenant.current

   expected: "db2"
        got: "db1"

   (compared using ==)
 # ./spec/integration/thread_spec.rb:32:in `block (3 levels) in get_thread'

I'm sure we can remove some looping to gain time.
It doesnt fail if we have use_schemas set to true

@emilebosch
Copy link

@mikecmpbll Would you be so kind to tell what the revelation/breaktrhough was?

@apneadiving
Copy link
Contributor Author

@emilebosch mike found out that we can actually change the connection handler at will (thus create one per thread => thread safe but more connections).
Relevant AR code: https://github.com/rails/rails/blob/fb03a9ab35ed22e569ec9cef8a50ef72754b5dbe/activerecord/lib/active_record/core.rb#L113-L121

@emilebosch
Copy link

@apneadiving Awesome. Thus making your own connection handler is the way to go i guess? Good work!

@apneadiving
Copy link
Contributor Author

@apneadiving
Copy link
Contributor Author

Could someone help me for a naming question please?

Basically to create another db on another server we must connect to this server first.
But we cant connect to a server, we have to connect to a db on this server (if nothing is passed, the server looks for a db having the user's name but its not a real convenient bet).

I try to name for the config which would provide the db name to use in order to connect to the server before creating a tenant.

@emilebosch
Copy link

Primary? / Landlord? / Metada
Tenant

I recon you’re lookoing for the database where we can collect all shared entities right?


Emile Bosch
[email protected]

On Oct 8, 2015, at 14:33, Benjamin Roth [email protected] wrote:

@apneadiving
Copy link
Contributor Author

@emilebosch No :)

Say you have your localhost and want to create a tenant on a server hosted on amazon.

To create the tenant db on this other server, you must first connect to it but you cant connect directly (thats not the way it works), you have to connect to one of its dbs (for postgresql the default name is often postgres).

@emilebosch
Copy link

Yes, Ok. Got it. The tenant host db basically, we don’t have AWS so i could be totally wrong.

From my perspective what happens is:

  1. connect to a server under a user that can “CREATE DATABASE” (and only create)
  2. Create the tenant database + specific user.

A landlord isn’t that wrong actually since he creates apartments/tenants. :)

Else its more like a jump database/hop?


Emile Bosch
[email protected]

On Oct 8, 2015, at 14:41, Benjamin Roth [email protected] wrote:

@emilebosch https://github.com/emilebosch No :)

Say you have your localhost and want to create a tenant on a server hosted on amazon.

Say you have to create the tenant db on an amazon's server. You must first connect to the amazon's server but you cant connect directly (thats not the way it works), you have to connect to one of its dbs (for postgresql the default name is often postgres).


Reply to this email directly or view it on GitHub #264 (comment).

@apneadiving
Copy link
Contributor Author

FYI the code is done apneadiving@01cf288

I now have to figure out how to test it. Basically its tough to have two different db servers on the same machine. Maybe I'll just test if my options are properly taken into account

bradrobertson added a commit that referenced this issue Nov 26, 2015
Custom connection options for tenants (basically to store on different servers) #264 #269 #270
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants