Preparing for Data Federation in SQL Azure

This blog will help you prepare for an upcoming release of SQL Azure that will offer support for Data Federation.  While no date has been provided for this feature, I was able to test an early preview and compiled a few lessons learned that can be shared publicly. Note however that certain items could not be shared in this blog because they are considered NDA material; as a result, you should expect additional guidance in future posts when the public Beta will be made available.

What is Data Federation?

First and foremost, allow me to explain what Data Federation is in SQL Azure. Many terms have been used, like partitioning, sharding and scale out. Other sharding pattern exist too, which could be described in similar terms; so it is difficult to paint a clear picture of what Data Federation really is without comparing with other partitioning models. (To view a complete description of sharding models, take a look at a white paper I wrote on the topic where with a link on this page:
The simplest model to explain is the linear shard: each database contains all the necessary tables for a given customer (in which case the customer ID would be the dimension). So in a linear shard, you will need 100 databases if you have 100 customers.
Another model is the compressed shard: a container (typically a schema) inside a database contains all the necessary tables for a given customer. So in a compressed shard, each database could contain one or more customers, all logically usually separated by a schema (providing a strong security data enforcement model). As you will see, Data Federation offers an alternative to schema separation.
The scale up model (which is not a shard technically) uses a single database in which customer records are identified by an identifier. Most applications fall in this category today. However the need to design flexible and scalable components in a cloud paradigm makes this design difficult to embrace.
Data Federation, as it turns out, is a mix of a compressed shard and a scale up model as described above, but instead of using a schema to separate records logically, it uses a federation (a new type of container) which is a set of physical databases (each database is called a federation member).  So within each database (a federation member) you can have multiple customers. You are in control of the number of customers you place inside each federation, member and you can either split customers (i.e. create new database containers) or merge them back at a later time (i.e. reducing the number of databases making up your shard).
As a result, Data Federation allows you to start with a single database (similar to a scale up model), and create federations over time (moving into the compressed shard territory). When you create a federation, you specify a range over the customer IDs you would like to split, and SQL Azure will automatically move the records in the appropriate database (federation member). You could then, in theory, design “as usual” using a scale up model and use Data Federation at a later time when performance requirements dictate.
I did say in theory.

Design Considerations

So at this point you are probably wondering… what’s the catch? It’s pretty simple, really. It has to do with the location of the data and performance implications of storing data across physical databases: referential integrity. [Read Cihan’s blog about the limitations on referential integrity here:]
Data Federation is indeed very powerful. It provides a new way to spread data access across databases, hence offering a shared-nothing architecture across federation members hosting your records. In other words: the more members your federation has (i.e. databases) the more performance is at your fingertips.
But in order to harvest this new power, you will need to make certain tradeoffs, one of which being referential integrity.  While foreign key relationships are allowed within a federated member, they are not allowed between members. So if you are not careful, you could duplicate unique IDs in related tables. Also, since you have distributed records, you may need to replicate certain tables on all your federated members, hence creating a case of data proliferation which will not be synchronized across federation members for you.
If you are considering to use Data Federation in the future and would like to take an early start in your design, I will make the following recommendations:
-          The smaller the number of tables, the easier it will be for you to reconcile records and enforce referential integrity in your application
-          Do not assume all existing Referential Integrity (RI) constructs will still exist in a federated database; plan to move certain RI rules in your application
-          It is possible that the more you create federations members, the more expensive your implementation will be (no formal pricing information is available at the time of this writing)
-          Data Federation still uses the scale up model as its core, since each federated table needs to contain the customer ID in question; so record security and access control is similar to a scale up model
-          You will need to leverage parallel processing extensively to query federated members in parallel if you are searching for records across your members, so if you are not familiar with the Task Parallel Library (TPL), take a look. You will need it
In summary, Data Federation is for you if you need to scale your database to dozens of databases or more due to performance and scalability reasons. It goes without saying that the simpler the database, the easier the adoption of Data Federation will be.  Performance and scalability improvements often come with specific tradeoffs;  Data Federation is no exception and as expected the tradeoff is referential integrity (similar to no-sql database systems).

Print | posted @ Saturday, July 23, 2011 10:23 AM

Comments on this entry:

Comments are closed.

Comments have been closed on this topic.