Thursday, October 15, 2009

The Joys of multiple releases

I had originally written a longer post over at wordpress, only to have Firefox crash while trying to move an image, and WP doesn't do auto-saving like blogger. So now I'm back...

Bazaar 2.0.1 and 2.1.0b1 have now 'gone gold' in that I've uploaded the official tarballs, and asked people to make installers for them. Once installers are made, then we'll make the official announcement.

For those who haven't been following, Bazaar has now split its releases into 2 series. The 2.0.x series is based on 2.0.0 and has only bugfixes. Things that could cause compatibility problems (new features, removal of deprecated code, etc.) is only done in the 2.1.0.x series. We're hoping that this can give people some flexibility, as well as giving us more flexibility. In the past, we've suffered a bit trying to maintain backwards compatibility for some features/bugfixes, only to break compatibility for a big feature. Instead of suffering the worst of both, we're trying to get the best of both. If something needs to break compatibility, it just goes in the dev branch. Note that the development branch is still considered 'stable', in that the test suite always passes, and the code is pretty much always ready for a release. We just don't make the same guarantees about stable internal apis for 3rd parties to use.

The other change to the process is to stop doing as many "release candidate" builds. Instead, we will just cut a release. If there are problems, we'll cut the next release sooner. The chance for regressions in the 'bugfix-only' 2.0.x series should be low, and getting away from pre-builds means less overhead. We will still be doing releases we call 'rc1' before the next major stable release (2.1.0), and in that vein we expect to do little-to-no changes from the rc1 to the final build.

However, this new system does increase overhead for a single release. As now it is equivalent to doing the rc and the final in the same day. Also, because we now have 2 "integration" branches, it requires a bit more coordination between them.

For example, this is the revision graph for the recent 2.0.1 and 2.1.0b1 release

The basic workflow that I used was something like
  1. Have a LOSA create 2 release branches lp:~bzr-pqm/bzr/2.0.1 and lp:~bzr-pqm/bzr/2.1.0b1
  2. Create a local branch of each
  3. Create another branch for doing my updates in, such as lp:~jameinel/bzr/2.0.1
  4. Update 2.0.1 with a new version string
  5. Update NEWS to clean it up, show that there is an official release, and provide a summary/overview of the changes.
  6. Land this update into the official 2.0.1 branch via PQM. (Unfortunately this can take up to 2 hours depending on a bunch of different factors. We are trying to get this down to more like 10 min.)
  7. Update my local copy from the final release. Tag it (bzr-2.0.1).
  8. Create the tarball
  9. Create the release launchpad
  10. Upload the tarball to the release
  11. While this is going on, go through the bugtracker and make sure that things mentioned in NEWS have the appropriate "Fix Released" state in the bug tracker, as well as being associated with the right milestones. With 34 bugfixes, this is a non-trivial undertaking.
  12. Merge the 2.0.1 final release into the 2.1.0b1 branch. (All bugfixes in the stable series are candidates for merging at any time into the development series.)
  13. Do lots of cleanup in NEWS. The main difficulty here is that bugfixes are present on 2 integration branches simultaneously, and those releases are slightly independent. We've talked about having the bugfix mentioned in both sections. Which would be more important if we ever make a development release without doing the corresponding stable release.
  14. Do steps 4-10 again for 2.1.0b1.
  15. While working or waiting on that, prepare lp:~bzr-pqm/bzr/2.0 since it is now going to be prepped for 2.0.2. This involves, bumping the version number, updating NEWS with blank entries for the next release (avoids some conflicts for people landing changes in that branch), and submitting all of that back to PQM.
  16. When that has finished, bring the 2.0 stable branch back into bzr.dev. And prepare bzr.dev for 2.1.0b2. (version number bumps, NEWS cleanups, etc.)
  17. In this case, cleaning up NEWS was again a bit of a chore. As now you have a file that should have a blank area for both the 2.1.0b2 changes, but also the 2.0.2 changes. Further, some of the changes that landed in bzr.dev in the mean-time, were not included in the 2.1.0b1 release. So you have to move them up into the new section. Getting NEWS right across 4 branches was quite a bit of work, and probably the hardest part (so far) of the process. Copy & Paste + bzr diff + bzr vimdiff we quite helpful here. Setting the news in bzr.dev to the exact copy from 'bzr-2.1.0b1' and then showing what was removed/added was a nice way to make sure to get everything .
  18. breathe
  19. Announce the tarballs, etc on the bzr mailing list, so that people can start preparing packages/installers.
  20. I'm also the windows installer packager, so I get to build around 8 installers... (standalone installers, and 3 python [2.4, 2.5, 2.6] installers.) Most of this is scripted, but often something breaks along the way.
  21. Update Pypi, freshmeat, ... with updates for the new versions. Twice. (At least here we did not update for 'rc' versions. So this work is strictly doubled.)
Overall, it is a fair amount of work. I think it will amount to 1-2 days of work time (spread out over 3+ days of real time). With any luck, it will amount to being 'more concentrated, but less often'. I would say that we'd get more practice, but we also try to rotate release managers. Both to spread the knowledge, and to avoid burnout. (Though Martin has been doing the last 4-or-so releases...)

So here's to everyone upgrading to their preferred release (in about a week's time).

Thursday, October 8, 2009

Refactoring work for review (and keep your annotations)

Tim Penhey recently had a nice post about how he split up his changes to make it easier to review. His method used 'bzr pipeline' and some combinations of shelving, merging, and reverting the merges.

However, while I wanted to refactor my changes to make it easier to review, I didn't want to lose my annotation history. So I took a different approach.

To start, with, I'll assume you have a single branch with lots of changes, each wrt different features. You developed them 'concurrently' (jumping back and forth between features, without actually committing it to a different branch). And now that you are done, you want to split them out again.

There are a lot of possible ways that you can do this. With some proponents prefering a 'rebase' style. Where you replay the commits you made in a new order, possibly squashing them, etc. I'm personally not a big fan of that.
Tim's is another method, where you just cherrypick the changes into new branches, and use something like bzr-pipeline to manage the layering. However in reading his workflow, he would also lose the history of the individual changes.

So this is my workflow.
  1. Start with a branch that has a whole lot of changes on it, and is essentially 'done'. We'll call this branch "dogpile".
  2. Create a new branch from it (bzr branch --switch ../dogpile ../feature1), and remove all of the changes but the 'first step'. I personally did that with "bzr revert -r submit: file1 file2 file3" but left "file4" alone.
  3. "bzr commit" in that branch. The delta for that revision will show a lot of your hard-worked on changes being removed. However "bzr diff -r submit:" should show a very nice clean patch that only includes the changes for "feature1".
  4. Go back to the original dogpile branch, and create a new "feature2" branch. (bzr branch --switch ../dogpile ../feature2)
  5. Now merge the "feature1" branch (bzr merge ../feature1). At this point, it looks like everything has been removed except for the bits for feature1. However, just using "bzr revert file2..." we can restore the changes for "feature2".
  6. You can track your progress in a few ways. "bzr diff -r submit:" will show you the combine differences from feature1 and feature2. "bzr diff -r -1:../feature1" will show you just the differences between the current feature2 branch and the feature1 branch. The latter is what you want to be cleaning up, so that it includes all of your feature2 changes, built on top of your feature1 changes. You also have the opportunity to tweak the code a bit, and run the test suite to make sure things are working correctly.
  7. "bzr commit" in that branch. At this point, the diff from upstream to feature1 should be clean, and the diff from feature1 => feature2 should be clean. As an added benefit, doing "bzr annotate file2" will preserve all the hard-won history of the file.
  8. repeat steps 4-7 for all the other features you wanted to split out into their own branches.
When you are done, you will have N feature branches, split up from the original "dogpile" branch. By using the "merge + revert things back into existence" trick, you can preserve all of the annotations for your files. This works because you have 2 sources that the file content could come from. One source is the "dogpile" branch, and the other source is a branch where "dogpile" changes were removed. Since the changes are present in one of the parents, the annotations are brought from there.

This is what the 'qlog' of my refactoring looks like.

The actual content changes (the little grey dots) actually span about 83 commits. However, you can see that I split that up into 6 new branches (some more independent than others), all of which generate a neat difference to their parent, and preserve all of the annotation information from the full history. You can also see that now that I have it split out, I can do simple changes to each branch (notice that purple has an extra commit). This will most likely come into play if people ask for any changes during review.

Monday, March 23, 2009

brisbane-core

Jonathan Lange decided to drop some hints about what is going on in Bazaar, and I figured I could give a bit more detail about what is going on. "Brisbane-core" is the code name we have for our next generation repository format, since we started working on it in our November sprint in Brisbane last year.

I'd like to start by saying we are really excited about how things are shaping up. We've been doing focused work on it for at least 6 months now. Some of the details are up on our wiki for those who want to see how it is progressing.

To give the "big picture" overview, there are 2 primary changes in the new repository layout.
  1. Changing how the inventory is serialized. (makes log -v 20x faster)
  2. Changing how data is compressed. (means the repository becomes 2.5:1 smaller, bzr.dev now fits in 25MB down from 100MB, MySQL fits in 170MB down from 500MB)
The effect of these changes is both much less disk space used (which also affects number of bytes transmitted for network operations), and faster delta operations (so things like 'log -v' are now O(logN) rather than O(N), or 20x faster on medium sized trees, probably much faster on large trees).


Inventory Serialization


The inventory is our meta-information about what files are versioned and what state each file is at, (git calls it a 'tree', mercurial calls it the 'changelog'). Before brisbane-core, we treated the inventory as one large (xml) document, and we used the same delta algorithm as user files to shrink it when writing it to the repository. This works ok, but for large repositories, it is effectively a 2-4MB file that changes on every commit. The delta size is small, but the uncompressed size is very large. So to make it store efficiently, you need to store a lot of deltas rather than fulltexts, which causes your delta chain to increase, and makes extracting a given inventory slower. (Under certain pathological conditions, the inventory can actually take up more than 50% of the storage in the repository.)

Just as important as disk consumption, is that when you go to compare two inventories, we would then have to deserialize two large documents into objects, and then compare all of the objects to see what has and has not changed. You can do this in sorted order, so it is O(N) rather than O(N^2) for a general diff, but it still means looking at every item in a tree, so even small changes take a while to compute. Also, just getting a little bit of data out of the tree, meant reading a large file.

So with brisbane-core, we changed the inventory layer a bit. We now store it as a radix tree, mapping between file-id and the actual value for the entry. There were a few possible designs, but we went with this, because we knew we could keep the tree well balanced, even if users decide to do strange things with how they version files. (git, for example, uses directory based splitting. However if you have many files in one dir, then changing one record rewrites entries for all neighbors, or if you have a very deep directory structure, changing something deep has to rewrite all pages up to the root.) This has a few implications.

1) When writing a new inventory, most of the "pages" get to be shared with other inventories that are similar. So while conceptually all information for a given revision is still 4MB, we now share 3.9MB with other revisions. (Conceptually, the total uncompressed data size is now closer to proportional to the total changes, rather than tree size * num revisions.)

2) When comparing two inventories, you can now safely ignore all of those pages that you know are identical. So for two similar revisions, you can find the logical difference between them by looking at data proportional to the difference, rather than the total size of both trees.


Data Compression

At the same time that we were updating the inventory logic, we also wanted to improve our storage efficiency. Right now, we store a line-based delta to the previous text. This works ok, but there are several places where it is inefficient.
  1. To get the most recent text, you have to apply all of the deltas so far. Arguably the recent text is more often accessed than an old text, but it is the slower text to get. To offset this, you can cap the maximum number of deltas before you insert a fulltext. But that also affects your storage efficiency.
  2. Merges are a common issue. As one side of the merge will have N deltas representing the changes made on that side. When you then merge, you end up with yet another copy of those texts. Imagine two branches, each changing 10 lines, when you merge them, if a delta could point at either parent, you could have a copy of 10 lines, but the other 10 lines looks like a new insert. Thought of another way, after a merge you have many lines that have existed in other revisions, but never in the same combination. Comparing against any single text would always be inefficient.
  3. Cross file compression. As a similar issue to the 'single parent' in (2), there are also times when you have texts that don't share a common ancestry, but actually have a lot of lines in common. (Like all of the copyright headers)
There are lots of potential solutions for these, but the one we went with we are calling "groupcompress". The basic idea is that you build up a "group" of texts that you compress together. We start by inserting a fulltext for the most recent version of a file. We then start adding ancestors of the file to the group, generating a delta for the changes. The lines of the delta are then used as part of the source when computing the next delta. Once enough texts have been added to a group, we then pass the whole thing through another compressor (currently zlib, though we are evaluating lzma as a "slower but smaller" alternative).

As an example, say you have 3 texts.
text1:
first line
second line
third line

text2:
first line
modified second line
third line

text3:
first line
remodified second line
third line

So if you insert text3 at the start, when you insert text2 you end up with a delta that inserts "first line" into the stream. When you get to text1, you then can copy the bytes for "first line" from text2, and "third line" from text3.

There are a few ways to look at this. For example, one can consider that the recipe for extracting text1, is approximately the same as if you used a simple delta for text3 => text2, and then another delta for text2 => text1. The primary difference is that the recipe has already combined the two deltas together. The main benefit is that to extract text3, you don't have to create the intermediate text2.

One downside to storing the expanded recipes is that there is some redundancy. Consider that both text2 and text1 will be copying "first line" from text3. In short examples, this isn't a big deal, but if you have 100s of texts in a row, the final recipe will look very similar to the previous one, and they will be copy instructions from a lot of different regions. (Development tends to add lines, so storing things in reverse order means those lines look like deletions. Removing lines splits a copy command into 2 copy commands for the lines before and the lines after.)

A lot of that redundancy is removed by the zlib pass. By doing the delta compression first, you can still get good efficiency from zlib's 32kB window. The other thing we do is analyze the complexity of the recipe. If the recipe starts becoming too involved, we will go ahead and insert a new fulltext, which then becomes a source for all the other texts that follow. There are lots of bits that we can tune here. The most important part right now is making sure that the storage is flexible to allow us to change the compressor in the future, without breaking old clients.


What now?

The branch where the work is being integrated is available from:
bzr branch lp:~bzr/bzr/brisbane-core
cd brisbane-core
make

For now, the new repository format is available as "bzr init-repo --format=gc-chk255-big", but is considered "alpha". Meaning it passes tests, but we reserve the right to change the disk format at will. Our goal is to get the format to "beta" within a month or so. At which point it will land in bzr.dev (and thus the next release) as a "--development" format (also available in the nightly ppa). At that point, we won't guarantee that the format will be supported a year from now, but we guarantee to allow support converting data out of that format. (So if we release --development5, there will be an upgrade path to --development6 if we need to bump the disk format.)

Going further, we are expecting to make it our default format by around June, 2009.

At this point changes are mostly polish and ensuring that all standard use cases do not regress in performance. (Operations that are O(N) are often slower in a split layout, because you spend time bringing in each page. But if you can change them into O(logN) the higher constants don't matter anymore.)

Thursday, August 14, 2008

This Week in Bazaar

Ah, to take a break from reporting to the world, but now we are back. This used to be a completely weekly series of posts about the on-going events in the world of Bazaar (and may be yet again). Written by co-authors John Arbash Meinel, one of the primary developers on Bazaar, and Paul Hummer, who works on integrating Bazaar into Launchpad.


Bazaar 1.6rc3 Released

With Martin Pool going on vacation for the next two weeks, John has stepped up to marshall 1.6 out the door. And he started with not 1 but 2 release candidates in 2 days. We're trying hard to get back into a time-based release schedule. The problem with sneaking in a feature-based release, is that they always end up slipping, as everyone tries to get "one-more-thing" in to the delayed release. However, with RC3, we've actually gotten the list of things that must be in 1.6 down to 0, so there is a very good chance it will become 1.6-final next week.

Since it has been a delayed release, there are lots of goodies inside to partake of. Stacked Branches, improved Weave merge, significantly faster 'bzr log --short', improvements to the Windows installation, better server side hooks, and the list goes on. Most of this we have mentioned in previous "This Weeks", the big difference is that it is available in a release, rather than just in the bzr.dev trunk.

The Windows install is one of the major changes, in that it will now (by default) bundle TortoiseBzr as part of the standalone install. TortoiseBzr still needs work before it is as much of a joy to work with as the rest of the system, but this release is mostly about testing our ability to bundle them together.


Looking forward to Bazaar 1.7

As 1.6 nears it's official release, the development community has started planning the 1.7 development process. As it stands now, bzr 1.7 has a planned release date of September 8th. This means there are two whole weeks two get various bugfixes and contributions to bazaar in before getting down to release time (mentoring available).

Among the proposed potential features, there are a few that really stand out. Mark Hammond has been polishing Bazaar on Windows, and there is much desire for someone to help getting the bazaar test suite to run cleanly in Mac OS X. These features will greatly add to the existing portability strengths of Bazaar. While the majority of changes needed are actually in the test suite, and not the core functionality, the community could really use someone who could step up, and learn how to do unit testing in Python. Bazaar 1.7 will also see some increased merge flexibilities, especially with criss cross merges.

Improvements to the indexing layer are likely to land in 1.7, though as always, not on the default format. (We want at least 1 release supporting a format before we suggest it as the default, to give people time for compatibility.) The new b+tree layout for indexes makes them smaller (by approx 2:1) and makes them faster to search (eg, bzr log file being 3x faster).

We also have a chance to land Group Compress, which has shown to compress repositories by as much as 3:1 over their current size. This change needs a bit more tweaking, though. There are generally tradeoffs between how much time you spend compressing, and how small the result is. And we want to make sure that we make the right tradeoffs. It is currently being evaluated as a test plugin.


Bazaar Bug Day

As Bazaar development speeds up, so do the incoming bugs. There are currently 1062 open bugs in Launchpad, and 287 of them have a "New" status, meaning they have not yet been triaged and categorized. At a past Bazaar sprint, a "bug day" was talked about, and it has been brought up again on the mailing list. Often, we fix many bugs and just haven't gotten around to marking them fixed. This is a great opportunity for members of the community who use Bazaar but don't directly develop it t o contribute back to the Bazaar community. You can help out by verifying bugs have been fixed, that they no longer exist, or that they still exist and provide more information on them. Come give your karma a boost, and help us squish some bugs!

Monday, July 28, 2008

Last Week in Bazaar

Well, I'm late this week, so I'm officially marking this post as Last Week in Bazaar. In my defense, I got busy last Thursday, and then my cohort (Paul Hummer) flew off to New Zealand for a work-related sprint. So today, I (John Arbash Meinel, a developer on Bazaar) get to exercise full control over the content.


Keyword Expansion

People often request the ability to expand keywords, like they are used to in SVN and CVS. We've sort of postponed the implementation, because probably 90% of the time, it isn't really the right solution to the problem users are having. Also, they are kind of a mess in CVS anyway. Where I used to work we tried to use $Id$ style expansion, only to find out that they conflict on every attempt at merging, and we started working hard to strip them out of our files. In a distributed VCS, you usually merge at least an order-of-magnitude more often, which also tends to reveal this problem.

SVN at least works around the problem, in that when you commit, it actually strips the texts of their expanded keywords, so that the repository never stores the expanded form. And merges are also done on the non-expanded form. Which fixes that little problem. Though it introduces a couple others. Specifically, what you have on disk is not what sits in the repository, nor is it exactly what you will get out of a fresh checkout. The biggest reason is that if you commit revno 1354, it will update the tags of files that are touched. But if you checkout revno 1354 it will update the tags of *all* files. (I'm not positive on this, but I know there was a bug which was causing problems for people trying to do conversions. Because they couldn't quite find the right invocation to have 'update -r 1354' (from 1353) give the exact same tree as 'checkout -r 1354').

The other reason keyword expansion is not usually what you want, is because it expands only for the given file. If you make a commit to 5 other files, the *tree* is at revno 1359, but the file with your:

my_version = "$Version$";

Tag is still pointing at 1354. (Again, if 'svn update' would force all the tags to get re-expanded it might work correctly, though you run into performance problems expanding every keyword in every file on every update.) Bazaar has supported the bzr version-info command for a while, which lets you generate a version file (possibly from a template) which can store all the real details. Including the last-modified version for every file, whether any files in the working tree have been modified since commit, etc.

The only case that I've really heard a good reason for keyword expansion is for a Website. Where each individual file is spread out into the world. So having a little "last modified" at the bottom can be pretty convenient. You also don't tend to have a "build" process which lets you generate the version information at that time.

However, as Bazaar is meant to be a flexible system, Ian Clatworthy has done a wonderful job of adding the ability to support content munging via plugins. And has continued on to write a plugin specifically for expanding keywords.

http://bazaar-vcs.org/KeywordExpansion

So for all those people who feel they really need keyword expansion, look it up.I would imagine that once people get a good feel for it, and it matures a bit, it has a good chance to be brought into core code. Or at least make it into the release tarball as an "official" plugin.


Open Source, Python, and Counting My Blessings

Now onto something a bit more personal. This last week I had cause to re-visit an old library I had written, and try to get it up and running again. (Specifically, the project was pydcmtk, python wrappers for the Dicom Toolkit.)
It took me several hours times several days just to get it to build and run the test suite again. All without changing any of the code. It was simply a build-step problem.
Which revealed a couple wonderful things about my current work:

  1. I get to work in Python. Which is a nice language, flexible, and *doesn't* need a build step. Not having to deal with C/C++ and all the complexities of getting dependencies built, with the right version of the compiler, and the right flags to the compiler.
  2. Microsoft has a much harder time on their hands than Open Source does, at least when it comes to compatibility. Specifically, each version of their compiler comes with a different runtime. And code compiled for Visual Studio 7.1 doesn't like to work with the 8.0 objects nor the 9.0 objects. And they all have different msvcrtX.X.dll files. However, because the official method for getting your program to users is in binary (object) form, they have to provide ways to support your binary files for a long time. So in VS 8.0 they introduced a new step, which is to post-process your linked binaries with a manifest, declaring what runtimes they use. Further complicating this is that if you try to run a 8.0 compiled dll, it just gives an opaque "This process has tried to access the runtime incorrectly."
    Not realizing this, I spent a long time comparing the exact compiler flags with other examples to fix it. (The boost build tool, bjam, knows how to do it, but there was a line "if exists foo.manifest: do stuff", which I originally read as "if not exists foo.manifest: create the manifest.")
  3. Open source has generally handled the binary compatibility issue by punting and requesting software compatibility. And then you have a whole bunch of groups that spend their time recompiling everything for you (distributions like Ubuntu or Red Hat). And then they give you all the dependencies with a few simple commands (apt-get install zlib-dev dcmtk-dev boost-dev). On Windows, if you want to switch to developer mode, you generally have to grab the source code for all of those dependencies, and recompile them for your exact configuration.
    Software-level compatibility is *much* easier to handle. Not the least of which because if something becomes incompatible you can fix it. (I remember a Microsoft memory issue, where they had to switch in bug-for-bug compatibility because fixing it broke SimCity, how much better if they could have just patched SimCity.)
    Binary compatibility (for C/C++) means that you can't even add members to structs, because then the size changes and malloc starts failing (plus members are referenced by offset, so adding something in the *middle* is a big no-no).
    Source-level turns this way down into not removing things people are using. And, if something does change, with source-level you can even write a patch to fix the code. This does make it quite a bit harder for people who want to release binary-only packages, that they then don't have to modify for years. (Though when updating is a simple process, people are willing to do it more.)

1.6rc1 soon to come

We are working on putting the final polish to stacked branches. We are trying to release something that people can feel comfortable using right away, and there are a few tricks to get there. (For example, bzr has a general policy of always preserving the source format when you do 'bzr branch'. It helps maintain compatibility within a project that hasn't chosen to upgrade to a newer format yet. However if you do 'bzr branch --stacked' that indicates you want to use the new feature, so we have to work out logic to create an upgraded target at the right time. This also turns out to conflict a bit with bzr-svn, which had its own logic to trick 'bzr branch' into not copying the source format.)

You can already play with the Stacked Branches feature in the beta releases, but they'll appear much more polished in the final rc.

Thursday, July 17, 2008

This Week in Bazaar

Welcome back to the terrarium of the Bazaar distributed version control system. Written by co-authors John Arbash Meinel, one of the primary developers on Bazaar, and Paul Hummer, who works on integrating Bazaar into Launchpad as he refines his plans for world domination from his shiny new lair.

Bazaar 1.6b3 released

The next beta release of Bazaar has just been cut, and is available at your local PPA:
https://launchpad.net/~bzr/+archive

The Windows installers should be available later today. This release provides lots of the shiny things that we've been talking about, like Stacked Branches, Real Weave Merge, more hooks for server-side operation, and lots of bug fixes and general polishing. The full UI for using stacked branches still needs a little bit of polishing, so the feature is not enabled by default. The functionality is all there, and if you are interested, we'd love to hear from you (kudos and complaints are equally welcome).


New updates to Gnome Bazaar Playground

Coming back from a very productive trip to Guadec, Tim Penhey has been overseeing some customizations to the Bazaar Playground for Gnome. All of the branches created at the local server in Turkey for Guadec have been added to the public playground. The Loggerhead installation has received some TLC by way of customizations to the UI. Accerciser's playground page is a good demonstration af the UI changes that have been made. The playground is actively being used by applications such as Brasero, jhbuild, Metacity and more.

One of the fun results of meeting with people at Guadec, is that it showed ways to improve Loggerhead when dealing with lots of projects and lots of branches. Work is continuing to make customizing Loggerhead's look-and-feel easier, and providing better tools for creating these "Bazaar Playgrounds" to use in evaluating Bazaar. The Bazaar developers are committed with making tools easier to use, and making the process as simple and powerful as possible.

Up and Coming Repository Format Updates

Robert Collins has been hard at work to refine how Bazaar stores its history information. We all like to have deep context, but we don't like to have to pay the penalty of downloading all of that context. Because Bazaar has a flexible repository structure, Robert has been able to play with changing the on-disk structure without major surgery to the rest of the code.

First is a change to how indexes are written, switching from a bisectable list to a btree structure. This paged structure allows us to compress the indexes, making them smaller, and faster to process remotely. It also reduces the number of lookups to find a key. (On average, a bisect search is log2N, while the btree is closer to log100N.) At the moment, he is testing this with a shared repository containing all of the projects available from in the Ubuntu apt repositories. This weighs in at around 13k branches, and somewhere around 20GB of disk space used.

Second is an update to how texts are stored. At the moment we use a simple format which places fulltexts periodically, and then stores deltas against those fulltexts. It has served us rather well, but can be improved upon. With his Group compress work, we can see a savings of as much as 2x-3x. Further, the data is stored such that you can do simple linear reads to get the base fulltext and all deltas necessary to generate a given fulltext. This reduces the pressure on indices, as you don't have to search for base texts. (Instead you just store a pointer to the start, and give the total length that needs to be read.)

These are still in development phase, but a format that uses them will likely appear in the next release (bzr 1.7).

Community Agile

Ian Clatworthy has recently released a wonderful document describing the workflow we (generally) use at Canonical. It describes how basic practices are similar to, and different from, other systems like Agile. The biggest (IMO) being a recognition that the community surrounding your project is one of the strongest and most important pieces. This has always been true in software development, but it has traditionally been somewhat hidden. Open Source has exposed just how powerful the community can be. For people interested in how software can be developed, rather than just what, I certainly recommend it.

Thursday, July 10, 2008

This Week in Bazaar

Here we are again, bringing you the gossip and dirty secrets in the development world of the Bazaar distributed version control system. In this, the 10th week, the series is now under new management, with co-authors John Arbash Meinel, one of the primary developers on Bazaar, and Paul Hummer, who works on integrating Bazaar into Launchpad.


Bundle Buggy

Aaron Bentley has once again been improving his wonderful Bundle Buggy. He just introduced support for multiple projects using a single instance of Bundle Buggy. There are now 5 Bazaar projects using the main bundle buggy instance. (Bazaar, bzr-gtk, Bundle Buggy itself, Bzrtools, and PQM.) Of course, Daniel Watkins has made excellent use of his time, and has managed to crank out lots of updates for PQM. At this point it is code clean up, reducing the dependencies making it easier to set up and install.

Bazaar playground for Gnome


Originally, John Carr set up Bazaar mirrors of all the Gnome modules, which people could then use as a starting point for publishing code and collaborating. This week, the Bazaar playground for gnome was created so that any Gnome developers could be involved in pushing, branching, and sharing code through bazaar. This new server runs Loggerhead for viewing the code committed to these Bazaar branches. Damned Lies is also set up on the playground. This server was also reproduced locally at GUADEC because of the flaky internet connection at the conference, and all those local branches will be moved to the playground shortly.


Weave merging and handling "interesting" history

One of the great things about having a large project like MySQL using your software is that they push and stretch you in ways that you haven't necessarily encountered before. Specifically, their branch workflow looks a bit like a pile of spaghetti. With several long-term maintenance branches, team branches based off of that, and individual developer branches based off of that. Patches have a tendency to travel in unexpected ways (you may go user => team => release 1 => release 2, or you might go release 1 => team => team-2 => release 2, etc). They also are very fond of 'null merging' patches that aren't relevant to the next release. They merge the change and revert the text changes and commit.

Bazaar supports all of this, but it exposes weaknesses in simple 3-way merge logic. Because patches don't flow in anything considered orderly, you don't have the opportunity to select a "clean" base very often. Bazaar has long had an option for doing a "--weave" merge. It didn't receive much attention for a while, and had become rather slow. It turned out to be a good fit for MySQL's workflow, so John has spent a bit of time recently to make the functionality efficient and correct in some specific edge cases. Expect the improvements to show up in the next release.