If you work on an application that uses a database, chances are you have to deal with releasing new versions of your software that make changes to it. The SQL language provides comprehensive support for making these types of changes and can access even advanced features of your chosen database platform. Schema changes are made through create and alter statements typically, and data movement is performed using selects and inserts.
When releasing your software initially, deployment is straightforward as there is no existing data to deal with. As users exercise the features in your software, rows of data are added to tables, and future changes require more care to not destroy or make invalid changes to the existing data.
In the past, DBAs or developers with sufficient SQL programming knowledge have written scripts to make the changes necessary to update database assets that have existing data in them, paying special care to typical situations like adding a new NOT NULL column (you need to initialize it with data to enable the constraint) splitting one column into two, or splitting some columns of a large table out into a new detail table.
For years seasoned developers have used the following approach for making changes to the database:
- Add a table with one row that stores the “version” of the database. This data is not really application data per se, but more like metadata that identifies the state the schema is in. This version is usually initialized to the lowest version where development starts, let’s say 126.96.36.199.
- Create SQL scripts when you have changes that check this row. If the version of the database the script is running against is lower than the “version” of your script, make your changes.
- When your script is done making the changes, increment the version number of the database row to its new version (188.8.131.52 for example).
The great thing about this approach is that it supports deploying changes to multiple versions of the same database. If you are “upgrading” version 184.108.40.206 database and your latest version is 220.127.116.11, any scripts that have numbers between these two versions will run in ascending order. If you are “upgrading” version 18.104.22.168 to the 22.214.171.124, scripts that apply to databases at a version prior to 126.96.36.199 will be skipped.
There are two gotchas with this approach:
- You need to test upgrading from any version you have in the field before deploying to that version. So if you are upgrading databases that can vary by 5 versions, you really need to test the upgrade process going from all 5 of these versions to the current version. This is more a consideration than a limitation as you always need to do this when supporting multiple upgrade paths for your software.
- Developers can make mistakes in their SQL script and look for the wrong version, or forget to update the version in the database to current if the operation is successful.
When Ruby on Rails was released, the ActiveRecord team along with David Heinemeier Hansson provided the then-emerging ruby community with a technology called migrations, that provides some extra help on top of this. Basically anytime you want to change the database, you would run a command at your operating system prompt that would generate a new script that’s prepended with a version number greater than the latest script in your source.
An example will help here.
- You run the command “rails generate migration create_users” and a file 00001_create_users.rb is generated. You put code in here to both update, AND rollback changes related to the “users” table for example.
- You run the command “rails generate migration create_roles” and a file 00002_create_roles.rb is generated. Notice the tool recognized your latest script version and created a newer version automatically.
When you want to deploy to a database, you run another command “rake db:migrate” which tells the “rake” (ruby make) build engine to run all of your database migrations against the target database. The migrations engine automatically does the work of checking the target version of the database, running only those scripts that apply, and incrementing the database version to the latest one that succeeded.
This approach solves the problem of developers having to version things manually, and really simplifies deployment to multiple versions of a database. It also allows developers to incrementally make changes needed to support changes they are working on, without stepping on the toes of other developers.
Enter Visual Studio 2010 database projects
With the release of Microsoft Visual Studio 2010, another approach was provided to developers for managing database versions. This approach was made available by Microsoft’s acquisition of DBPro.
The VS 2010 DB project approach is to have a type of project in your solution that contains scripts that can create all the artifacts in your database. There are create scripts for stored procedures, schemas, roles, tables, views etc. However, the tool is sold as not requiring developers to know as much SQL programming, but rather they are provided with a treeview panel in the Visual Studio IDE (referred to as “Schema View” in the documentation). They can interact with this tree to add tables, rename columns, and make other trivial changes via a GUI and these changes are then saved as new SQL scripts in the project.
What happens when you deploy your DB project is that an engine that is part of the build system in Visual Studio does a compare of the target database being deployed to with what a “new” database would look like based on the scripts in your project, and then generates a script to alter it to make it’s structure match the project’s source code. The engine works much like RedGate software’s “SQL compare” tool in that it is fairly intelligent about determining changes in schema and generating appropriate scripts.
At first glance, this seems like a superior solution as it gives point-and-click programmers more productivity, removes version management from the picture, and eliminates the need to manually create alter scripts. In practice however, by itself this approach will not meet the requirements of most deployment cycles.
Microsoft released an ALM rangers guide to using Visual Studio 2010 database projects that is meant to be used as primary guidance for developers, DBAs, and architects looking at how to use best practices around VS 2010 DB projects. Part of this guide talks about “Complex Data Movement”, or what I will refer to here as “changing database assets containing data” because that’s really what they are talking about.
Unfortunately Microsoft’s solution for this “complex” scenario (which is common and regular, in my experience) is to subvert the diffing engine and revert to the use of temporary tables and pre/post build scripts to trick the engine into thinking the schema doesn’t need to be changes while fixing it up afterwards. This issue is described in the ALM rangers guide, and also on Barclay Hill’s blog post here.
Jeremy Elbourn comments on the MSDN forums why this approach actually makes maintaining database changes over time even more difficult than the migration approach in a real world environment. Microsoft also recently announced the availability of database migration support in ASP.NET MVC 4 (but only if you are using Entity Framework as well). These developments leave folks responsible for determining a database change management approach confused as to where the best practices are going with respect to Microsoft’s vision.
It is of my opinion that Visual Studio 2010 database projects should be avoided in favor of a migrations engine for the following reasons:
- The success or failure of employing VS 2010 DB projects in real world, enterprise sized clients has yet to be demonstrated in measurable capacity and the technology is still relatively new. I’ve seen some press releases, but these are marketing announcements with no downloadable artifacts to evaluate. I also have been discussing the tradeoffs with one colleague who is using it on a single application for an enterprise client with many integrated applications.
- I tend to embrace tools that generate code for me or do work automatically only when they are comprehensive, well-understood, and have limited “gotchas”. Schema and data change management is a complex topic and the VS 2010 database project approach leads developers to think the solution is easy, while in practice it forces them to understand how the diffing engine works, the project structure and deployment lifecycle of a DB project build, and how to circumvent the diffing engine to change database assets containing data.
- The ALM guide proposes detecting existing schema state to determine when pre or post deployment scripts need to be run. “If this column exists, run this script”. This is an error prone and ignorant approach. What if version 1 has this column, version 2 does not, and version 3 adds it back in? This kind of check will fail. Ironically the workarounds for this are to come up with a custom versioning and incremental migration strategy for your pre/post build scripts anyway, which is a red flag to me that the design is flawed.
- The best time to write tests for changes being made to a database and reviewing their impact is when making the changes, as the structural impact is fresh in the developers’ mind. Using the diffing tool, the generated alter scripts still need to be reviewed prior to deployment, especially if you don’t have a high coverage functional and acceptance test suite to ensure no breakage was caused by the change. Chances are you have an operations person reviewing the changes before running them on production, and without comprehensive testing you are relying on them to make sure the changes are appropriate. I hope you are working closely with operations during the entire development lifecycle in this situation!
Migrations work for all technologies and are simpler to understand and maintain
If you would like to use migrations today without both adopting ASP.NET MVC 4 and Entity Framework, Thoughtworks created an open source tool “DBDeploy” (with a corresponding .NET version, DBDeploy.NET) that they use with all of their clients and handles this elegantly. The only difference between it and the rails migration approach is that rails migrations use a DSL for making the changes, while DBDeploy uses SQL.
UPDATE (6/29/2012) I now recommend using RoundhousE as it has better support for more databases, uses .NET instead of Java, and gives you dedicated directories for stored procs, functions, and other assets that can get dropped and recreated each time without having existing data come into the picture.