Thursday, May 29, 2008

This Week in Bazaar

This is the fourth in a series of posts about current topics in the development world of the Bazaar distributed version control system. The series is co-authored by John Arbash Meinel, one of the primary developers on Bazaar, and Elliot Murphy, imaginary boy and part-time impostor.


Stacked branches

Some projects are very big with lots of files and lots of history. Many projects want to maintain the policy that development is done on independent branches, which are then merged back when the development is complete. However, the overhead of downloading, branching, and uploading the full history is prohibitive. There are a couple of different ways to solve this problem.

Dealing with a large branch can be split into two problems: downloading and uploading.

Bazaar has had a storage optimization called shared repositories for quite a while. This serves to dramatically reduce the amount of data downloaded for the second, third, etc branches of a project. A shared repository is a big pool of revisions which multiple branches point to. When you grab a new branch into a shared repository, bzr figures out how much of the history it already has, and only downloads the new revisions. So the first branch of a large project transfers most of the data, and grabbing additional branches is very cheap. In extreme cases, like working on a multi-gigabyte project from a 56k dial-up connection, you could even do things like distribute the initial data on a DVD to prime the shared repository, and then the user only needs to download incremental changes.

This technique can also be used for solving the uploading problem. If the upload location uses a shared repository, then uploading a new branch can just copy the new data. The problem with this, is once you start introducing multiple users, who decide that they may not want to give access to other people to push data into their repository.

Another approach to minimizing the data uploaded is called server side forking, and you can see a nice implementation of this on github.com. The user places a request with the code host to do the copy for them, and when it finishes, they have their own location already primed with the current branch.

The Bazaar project is approaching it in a different way. If some data is already public, then you can just reference the other public location when you start uploading your new branch. The first steps in this direction are being termed "Stacked Branches". Basically, instead of requiring all branches to contain the full history, you are allowed to "Stack" a branch on top of another. Because the uploader does not have write access to the lower levels of the stack, this addresses the security risks of shared repositories.

Stacking also opens up possibilites for the "download" side of the equation. For many users, they don't need a very deep copy of history to get their work done. If there is a location that can be trusted to be available when they need it, they can copy just the tip revisions. Which would allow them to do most of their work (commit, diff, etc) without consulting the remote host. And when they need more information (such as during a complex merge), the bzr client is able to fall back to the original source to get any needed information.

The goal of all this is to make it very easy to start working with a large project, while still making all the history available in a meaningful way. The bulk of this work has been completed, and it is likely that it will land in bzr 1.6 (to be released in a couple of weeks.)

4 comments:

alexweej said...

I am a Git user right now, but this kind of functionality is EXACTLY what I would expect of a DRCS. Being disappointed that Git's shallow clones are so stillborn, I can see myself checking out(!) Bazaar very soon!

Keep up the good work. You guys are rocking.

Reid K said...

This could a lot for getting Bazaar adopted by large corporations. I'm working for one at the moment, and we store all code in one giant repository so that all of the changes go live to the other engineers. However, that means the history includes all of everything going back about a decade, and it's gigs of data. This would make bzr the only viable DVCS choice if we were to ever switch SCMs.

Anonymous said...

@alex

It will be tough to give up the good performance of Git if you move to Bazaar.

Archimedes Trajano said...

I've got to admit that using Git is significantly faster than Bazaar (or any version control system). However, I stuck with Bazaar as it adapts to my needs a bit better than Git does.

My main source of complaint is Bzr+Svn still cannot handle importing an existing Curam project. Which is why I haven't used it for work. (Although I haven't tried to do it with a new project).