Data Segregation in a multi-tenant system

God that’s a sexy title!  Hahah…

This post is all about keeping customer data out of the hands of other customers.  There are many options that can be implemented.  I found three likely candidates and chose one.  Here’s the one I chose and why I chose it.

Here’s the setup… we have several product offerings where customers house their data in our systems.  The problem is that some data in the system is shared while other data is only for the specific customer.  For example, we wouldn’t want customer A to ever be able to see any data that belongs to customer B (or C or D….).  This is a very simple problem to understand.

Just like in most problems with programming, there any many different ways to solve the issue at hand.  Here are two options I *didn’t* choose.

Option 1:  Different Databases

The easiest way to segregate data is to create different database instances.  This would provide 100% data segregation.  Unless customer A has the connection string for customer B’s database, they won’t get their data.  If customer A *does* have customer B’s connection string, they won’t get their own data.  Very secure.  Another plus for this scheme is that if customer A loads the database with a ton of data, customer B’s performance won’t be impacted.

The problem is the sharing of data that everyone is supposed to get.  This would mean another “shared” database that would contain all of the shared data.  Then separate database calls would be made to collect that shared data.  More database calls means worse performance.

Another problem is cost.  Maintaining several different databases is costly.  Keeping all of the stored procedures and functions and table definitions all in sync is difficult.    Hosting several different databases is costly.  I don’t know how much it costs to host a database server at Amazon AWS, but it’s not cheap… well, not as cheap as another option.

Option 2:  Owner Key Columns

Another option is to add a column to every table.  Instead of having different databases, we have a traditional setup with a single database.  Every table will have at least one extra column in it to indicate who “owns” that record.  Then, when querying the database, that column is used to only return the records where the user making the query has rights.

Unlike option 1, costs for this method are much lower.  There’s only one database instance to maintain.  There’s only one schema to keep up to date.  You won’t have to make sure that all of the table definitions and stored procedures are in sync because there’s only one.  Plus, accessing that shared data is a snap.  There’s only one connection string to use so making the connection is simple.

This method has it’s own set of problems though.  First is performance.  If customer A loads the database up with a ton of data, performance for all customers will be impacted.  Second is security.  With all of the data housed together, there’s a possibility that someone might be able to get at someone else’s data.  This can be mitigated with proper use of the owner column but that leads to the third issue… complexity.  Writing the stored procedures and views properly becomes very complex.  If one little thing isn’t right, someone will get someone else’s data or might not get the data they’re supposed to.

Option 3:  Different Schemas

So how to we limit cost and complexity while still maintaining performance and data security?  The option we went with is to have a single database server but several different schemas within it.

Basically,  there is one master schema.  This is what the system connects to.  This is where all of the shared data is stored.  This is *also* where all of the stored procedures and functions are stored.  Customer data is all housed in separate schemas so there’s no chance of cross pollination of data.  How does this work?

Every stored procedure call includes an access key parameter.  This access key is the user id that belongs to the user making the call.  User 1 from customer A has an access key.  User 2 from customer A has a different access key.  When either one of these two users makes a call to a stored procedure their access keys are passed in as parameters.  The stored procedure looks up which customer schema the user should actually query.  It then constructs the query to access the proper schema and executes it.  In the event the stored procedure needs any of that shared data, we’re already in the schema where the shared data lives so it’s available to anyone.

What does this look like in practice?  Consider the following database.  We have three schemas; a “control” schema and two “tenant” schemas.

Screen Shot 2015-03-18 at 2.11.41 PM

When a user from Tenant 1 wants their list of users, we need to do the following.

SELECT * FROM DemoTenant1.Users;

The stored procedure where this actually happens is in the DemoControl schema.  Using the SchemaAccess table, we can look up the schema the user has access to based on the access key they pass in.  Our “GetUsers” stored procedure looks like this:

Screen Shot 2015-03-18 at 2.15.06 PM

Notice how we are building our query dynamically?  We do this so we can specify the schema from the value returned from the “SchemaFor” function.  It takes the access key passed in and looks up the schema name.

Using this mechanism, we can be assured that someone without an access key to DemoTenant1 will not be able to get that data.

Scroll To Top