Data Isolation and Sharing in a Multitenant System - Part 1
In the previous post I've shown few of the additional benefits a well encapsulated data access implementation brings. Implementing multitenancy is one of them. This article continues the previous, picks up the multitenancy context from there, and shows how we can address one of the most important design challenges of multitenant applications, which is Data Isolation and Sharing.
The article turned out quite long, so I have structured it in three parts. This first part focuses on describing the two strategies for Data Isolation and Sharing and more important on how to choose between them. The second and the third parts will show the implementation details on for each strategy.
Multitenancy
Before we go into details, lets review a bit what multitenancy is. The wikipedia definitions says:
The term software multitenancy refers to a software architecture in which a single instance of software runs on a server and serves multiple tenants.
A tenant is a group of users who share a common access with specific privileges to the software instance.
With a multitenant architecture, a software application is designed to provide every tenant a dedicated share of the instance.
We can think of a tenant as an organization which is a customer of our application. The users that belong to that organization is the group of users that form that tenant.
We do multitenant systems because they allow for cost savings. If we have one instance of the application for all our customers we may save money on hardware, software licenses and operational costs. Almost any application has a certain overhead on the resources it needs. If we spread it over more customers of the same app instance we can reduce it. There is memory and processing which is usually not used or used only at peak time; the effort to monitor and operate one instance of the app does not grow with its size as it grows with monitoring more instances of the same app; similar, the number of licences (servers and tools) is more dependet on the number of instances rather than on the size of one instance. All these make a good case for multitenancy architectures over multi-instance architectures. A perfect case for SaaS.
The cost savings can be eclipsed by two main things:
- the difficulty of scaling, and
- the difficulty of satisfying each tenant specific needs.
The one application instance should be able to scale when demand grows, when we add more tenants. There are many aspects to consider when designing for scalability, but I won't detail them in this post. For the second, we should aim for having the same functionality for all the tenants. If one client asks for a new feature and we implement it, then all clients will get it. Maybe some will not use it, or we configure the system, so that they don't have access to it, but it will be there. This also points another important aspect in a multitenant design, which is configurability.
Data Isolation and Sharing
At this point, we should have a good understanding of what multitenancy is and when it makes sense to implement a multitenant architecture. Now, lets focus on the Data Isolation and Sharing aspect, and go back to the scenario from the previous post.
To make it simpler, lets take an example. Lets think of an application which is used in clinics for physiotherapy. The application provides functionalities to assist the doctor, the physiotherapist and the patient during the recovering from surgery or injury. The doctor may give prescriptions, and the physiotherapist may compose a set of exercises that the patient should do, and also to monitor the patient progress. Such an application may be used by large hospitals and at the same time by very small physiotherapy clinics. They would all need a similar functionality, so it makes sense to make it a multitenant application, where each clinic or hospital is a tenant. (This is a simplified example of what we do at MIRA)
Here, we'll have data which belongs to each tenant. For example all the patient related data, which includes: the patient profile, the patient file, the progress the patient had during any rehabilitation program at that specific clinic, etc. These are tables like Patient
, PatientFile
, RehabProgram
, PatientHistory
etc. This data needs to be isolated for each tenant. Each clinic has to feel like that it is the only one on the platform, so a doctor at a clinic cannot see the patients of another clinic and so on. At the same time there is also data which should be shared by all tenants. Here we may have diagnostics, a common set of exercises, a common set of tools, etc. These are tables like: Diagnostics
, Exercises
, RehabTools
etc. This is typically system data, which we want to be used by all tenants and we want that any changes of it to be available to all tenants. Usually it is maintained by system administrators, but there are cases when we want that users that belong to any tenant to be able to edit it. For example we might want that any doctor to be able to enrich the diagnostics that the application knows.
We'll find the same in any multitenant application. We're always have:
- tenant specific data that needs to isolated for each tenant, and
- tenant shared data which needs to be shared between tenants
Another good case for multitenancy are applications that are used in the insurance field. By hosting on the same application instance, large insurance companies with the very small ones we could reduce the operational costs by sharing the resources. Tenant specific data would be the: insurance products, the policies which belong to each insurance company (tenant) etc. and tenant shared data would be the regulations data or the actuarial tables (which are used by all insurance companies to calculate the premiums), etc.
Separate Databases or Shared Database
Once we've identified the tenant specific data and the tenant shared data, the next step would be to decide between the two main strategies for storing data:
- Separate Databases one for each tenant, or
- a Shared Database used by all tenants
Both of the above strategies have pluses and minuses, and there are many tradeoffs to consider when making this choice.
Having a Separate Database for each tenant is easier to implement, especially in the case when we add multitenancy at a later stage of the project (after part of the functionality was already implemented as if this would not be a multitenant app). In this case each tenant will have access only to its own database and the data schema should not be changed from the non-mutitenant version.
This is also simple for isolating the tenants data. Based on the current user, we know the tenant it belongs to and we'll connect to that tenant database. So there no risk that we'll mix multiple tenants data. (Of course if we have caching at higher levels those should also be tenant aware, but that's another topic.)
To maximize the cost benefits of the multitenancy app, we should keep identical schemas for each database of each tenant. This means that when we make a schema change (because we add a column for a new feature) this change needs to be applied to all the databases. Therefore, we'll have the same database version (in fact the same app version) for all the tenants. This means that our automated deployment tool will need to upgrade the schemas of all the databases when we deploy a new version in testing, acceptance or production environments.
For the tenant shared data one approach is to duplicate it in the databases of each tenant. This means that we need to synchronize its changes on all the databases. If it is only system data, then this may be done in the same process with updating the databases schemas from version to version. Otherwise, if users can change it (operational data), then we need to put a sync mechanism in place. In most of the cases it does not need to be a real-time sync, because changes done by one tenant, on the shared data, do not need to be instantaneusly available to the others.
Another approach is to have a separate database for the tenant shared data only. So for a tenant we will connect to its own database for its specific data and to the tenant common database for the shared data. We keep the simplicity on isolating the tenant data and we don't need to do any data sync. However, we would need to take out the tenant shared data tables from the tenant specific database and use some GUID as IDs to link to it. This approach may make even more sense if we have a large set of tenant shared data, which frequently changes and it which is not very interconnected with the tenant data. With this approach we could even consider a non-relational database for the tenant common database.
The other strategy, a Shared Database used by all tenants, means that we keep data from all the tenants in the same database. To isolate tenant specific data, we will have to add a discriminator column like TenantID
to every table which is tenant specific, and to make sure that all the queries and commands will filter the data based on it.
With this strategy dealing with tenant shared data is simple, we just don't filter it. Isolating data is what we need to deal with. For this we need to make sure that ALL the queries and the commands that deal with tenant specific data get filtered by the TenantID
. Here, having a well encapsulated data access through which all the queries and command go can play a key role in assuring this. The data access would be the place where we intercept each query and command, we determine if it deals with tenant specific tables, and if yes we append a WHERE
clause that filters it based on the current tenant. With Linq we would like to add something like:
int currentTenantId = GetCurrentUserTenanatId();
IQueryable<PatientFile> originalQuery
.Where(p => p.TenanatID == currentTenantId);
In the second part of this article (next post), I will detail this code and show how we can make it generic as part of the data access implementation.
This strategy is easier to implement if we know from the beginning that we should build a multitenant system, because we will consider it while designing the database and it will lead to a better design. Also it will be easier to set and maintain the conventions based on which we can implement the generic code that appends the tenant filter. This doesn't mean that we cannot add it at a later stage of the project, IF we can rely on a well encapsulated data access and if we have a good database design in place. After we decide which is tenant specific data and which is tenant shared data, we can build some SQL scripts that add the TenantID
column. This database refactoring may be more difficult if we're already in production, because we might also need to migrate production data to the new schema of the database.
Both these strategies have pluses and minuses and the choice which one is better depends on many aspects including the application field and business, the moment when multitenancy should be added and also on the advantages we want to achieve from the multitenancy model. Below tables illustrate a summary of the benefits and the liabilities of these two main strategies for isolating and sharing data in a multitenant system:
Separate Databases
Benefits | Liabilities |
---|---|
Easy to implement (especialy at a later stage) High data isolation |
Tenant shared data is duplicated accros tenant databases Higher costs for maintaining the tenanat shared data Higher operational costs due to more databases Higher costs for deploying new versions |
Shared Database
Benefits | Liabilities |
---|---|
Higher level of resurce sharing, which may lead to lower costs Development, maintainability and operational costs do not depend on the number of tenants Tenant shared data is easy to maintain |
Low data isolation - requires a data access layer that intercepts all queries and commands Monitoring tenanat data activity is a challenge Backup and restore a single tenanat data requires a custom solution |
The tenant shared data may play an important role on deciding between these two strategies. For example, an application where the tenant shared data is operational data (users from all tenants may modify it), and it represents a significant part from the entire application data model, and it is highly connected (related) to the tenant specific data, it makes a good case for the Shared Database strategy. In the same example, where the tenant shared data is operational data, is big, but it is not very related to the tenant specific data we could go for the Separate Databases strategy, even more if we're adding multitenancy later or if we have very few tenants with large data sets or if we cannot assure data isolation through other means.
There are many other metrics we should measure or estimate before making a decision on one of these strategies. Here are a few:
- the number of tenants
- the number of users per tenant
- amount of data or data transactions per tenant
- frequency of adding tenants or removing (disabling) tenants
- the ratio between small and big tenants
- the frequency of database schema changes
For example a scenario with many small tenants where we want to have the flexibility to add or disable tenants often, would make a good case for the Shared Database strategy. On the other hand the scenario with a fairly fixed number of tenants, which are big (in data amount and data transactions) and similar in size makes a better case for the Separate Databases strategy.
The development model may also play a role in making this decision. For example if we want to go very fast into production, with minimum business functionality and then increase the functionality by doing Continuous Delivery the Shared Database strategy may be a better choice because we'll need to change the database schema quite often. On the other hand if we go into production after most of the functionality is done, and we don't foresee many changes to the database schema Separate Databases strategy may be a good choice because the liability of high costs on updating more databases schemas won't be paid that often.
So, if we put some numbers on these metrics and we balance the benefits and the liabilities of these two strategies we could make a good decision in our context. Even more, we could end up with a strategy which is a hybrid between these two if we have external constraints and we try to take benefits from both or to minimize the liabilities of one of them. For example, in a scenario with one big tenant and many other small tenants we might have one database for the big tenant and one shared for the small ones. With this separation we may target benefits in planning the activities for the backup and restore, or the schema changes for the big fish from the small ones. Another scenario which may lead to such an hybrid is when we go with the Shared Database and at a certain point we need to scale at the database level. We can create a new database and distribute the tenants among them.
Closing
At this point we should have a good image on the strategies for the Data Isolation and Sharing in a multitenancy application and how to choose the one that suits best a given context. In the second part of the article (the next post) we will go in details on how to implement each of them, by looking at some code snippets that give an implementation direction based on a well encapsulated data access.