Monday, March 23, 2009

brisbane-core

Jonathan Lange decided to drop some hints about what is going on in Bazaar, and I figured I could give a bit more detail about what is going on. "Brisbane-core" is the code name we have for our next generation repository format, since we started working on it in our November sprint in Brisbane last year.

I'd like to start by saying we are really excited about how things are shaping up. We've been doing focused work on it for at least 6 months now. Some of the details are up on our wiki for those who want to see how it is progressing.

To give the "big picture" overview, there are 2 primary changes in the new repository layout.
  1. Changing how the inventory is serialized. (makes log -v 20x faster)
  2. Changing how data is compressed. (means the repository becomes 2.5:1 smaller, bzr.dev now fits in 25MB down from 100MB, MySQL fits in 170MB down from 500MB)
The effect of these changes is both much less disk space used (which also affects number of bytes transmitted for network operations), and faster delta operations (so things like 'log -v' are now O(logN) rather than O(N), or 20x faster on medium sized trees, probably much faster on large trees).


Inventory Serialization


The inventory is our meta-information about what files are versioned and what state each file is at, (git calls it a 'tree', mercurial calls it the 'changelog'). Before brisbane-core, we treated the inventory as one large (xml) document, and we used the same delta algorithm as user files to shrink it when writing it to the repository. This works ok, but for large repositories, it is effectively a 2-4MB file that changes on every commit. The delta size is small, but the uncompressed size is very large. So to make it store efficiently, you need to store a lot of deltas rather than fulltexts, which causes your delta chain to increase, and makes extracting a given inventory slower. (Under certain pathological conditions, the inventory can actually take up more than 50% of the storage in the repository.)

Just as important as disk consumption, is that when you go to compare two inventories, we would then have to deserialize two large documents into objects, and then compare all of the objects to see what has and has not changed. You can do this in sorted order, so it is O(N) rather than O(N^2) for a general diff, but it still means looking at every item in a tree, so even small changes take a while to compute. Also, just getting a little bit of data out of the tree, meant reading a large file.

So with brisbane-core, we changed the inventory layer a bit. We now store it as a radix tree, mapping between file-id and the actual value for the entry. There were a few possible designs, but we went with this, because we knew we could keep the tree well balanced, even if users decide to do strange things with how they version files. (git, for example, uses directory based splitting. However if you have many files in one dir, then changing one record rewrites entries for all neighbors, or if you have a very deep directory structure, changing something deep has to rewrite all pages up to the root.) This has a few implications.

1) When writing a new inventory, most of the "pages" get to be shared with other inventories that are similar. So while conceptually all information for a given revision is still 4MB, we now share 3.9MB with other revisions. (Conceptually, the total uncompressed data size is now closer to proportional to the total changes, rather than tree size * num revisions.)

2) When comparing two inventories, you can now safely ignore all of those pages that you know are identical. So for two similar revisions, you can find the logical difference between them by looking at data proportional to the difference, rather than the total size of both trees.


Data Compression

At the same time that we were updating the inventory logic, we also wanted to improve our storage efficiency. Right now, we store a line-based delta to the previous text. This works ok, but there are several places where it is inefficient.
  1. To get the most recent text, you have to apply all of the deltas so far. Arguably the recent text is more often accessed than an old text, but it is the slower text to get. To offset this, you can cap the maximum number of deltas before you insert a fulltext. But that also affects your storage efficiency.
  2. Merges are a common issue. As one side of the merge will have N deltas representing the changes made on that side. When you then merge, you end up with yet another copy of those texts. Imagine two branches, each changing 10 lines, when you merge them, if a delta could point at either parent, you could have a copy of 10 lines, but the other 10 lines looks like a new insert. Thought of another way, after a merge you have many lines that have existed in other revisions, but never in the same combination. Comparing against any single text would always be inefficient.
  3. Cross file compression. As a similar issue to the 'single parent' in (2), there are also times when you have texts that don't share a common ancestry, but actually have a lot of lines in common. (Like all of the copyright headers)
There are lots of potential solutions for these, but the one we went with we are calling "groupcompress". The basic idea is that you build up a "group" of texts that you compress together. We start by inserting a fulltext for the most recent version of a file. We then start adding ancestors of the file to the group, generating a delta for the changes. The lines of the delta are then used as part of the source when computing the next delta. Once enough texts have been added to a group, we then pass the whole thing through another compressor (currently zlib, though we are evaluating lzma as a "slower but smaller" alternative).

As an example, say you have 3 texts.
text1:
first line
second line
third line

text2:
first line
modified second line
third line

text3:
first line
remodified second line
third line

So if you insert text3 at the start, when you insert text2 you end up with a delta that inserts "first line" into the stream. When you get to text1, you then can copy the bytes for "first line" from text2, and "third line" from text3.

There are a few ways to look at this. For example, one can consider that the recipe for extracting text1, is approximately the same as if you used a simple delta for text3 => text2, and then another delta for text2 => text1. The primary difference is that the recipe has already combined the two deltas together. The main benefit is that to extract text3, you don't have to create the intermediate text2.

One downside to storing the expanded recipes is that there is some redundancy. Consider that both text2 and text1 will be copying "first line" from text3. In short examples, this isn't a big deal, but if you have 100s of texts in a row, the final recipe will look very similar to the previous one, and they will be copy instructions from a lot of different regions. (Development tends to add lines, so storing things in reverse order means those lines look like deletions. Removing lines splits a copy command into 2 copy commands for the lines before and the lines after.)

A lot of that redundancy is removed by the zlib pass. By doing the delta compression first, you can still get good efficiency from zlib's 32kB window. The other thing we do is analyze the complexity of the recipe. If the recipe starts becoming too involved, we will go ahead and insert a new fulltext, which then becomes a source for all the other texts that follow. There are lots of bits that we can tune here. The most important part right now is making sure that the storage is flexible to allow us to change the compressor in the future, without breaking old clients.


What now?

The branch where the work is being integrated is available from:
bzr branch lp:~bzr/bzr/brisbane-core
cd brisbane-core
make

For now, the new repository format is available as "bzr init-repo --format=gc-chk255-big", but is considered "alpha". Meaning it passes tests, but we reserve the right to change the disk format at will. Our goal is to get the format to "beta" within a month or so. At which point it will land in bzr.dev (and thus the next release) as a "--development" format (also available in the nightly ppa). At that point, we won't guarantee that the format will be supported a year from now, but we guarantee to allow support converting data out of that format. (So if we release --development5, there will be an upgrade path to --development6 if we need to bump the disk format.)

Going further, we are expecting to make it our default format by around June, 2009.

At this point changes are mostly polish and ensuring that all standard use cases do not regress in performance. (Operations that are O(N) are often slower in a split layout, because you spend time bringing in each page. But if you can change them into O(logN) the higher constants don't matter anymore.)