John Arbash Meinel's Bazaar Blog: 2007

Wednesday, October 17, 2007

Bazaar vs Subversion

Every so often someone comes along wanting to know which VCS they should use. I won't claim to be an impartial observer, but this is a list of things I put together for the last discussion, that I thought I would share here.

SVN requires all commits to go to a central location, and tends to favor having multiple people working on the same branch.

This is both a positive and a negative depending on what you are trying to do.

When you have a bunch of developers that don't know a lot about VCS, it simplifies things for them. They don't have to worry about branches, they just do their work and check it in.
The disadvantage is that they can tread on each other's toes (by committing a change
that breaks someone else's work), and their work immediately gets
mixed together and can't be integrated separately.

Bazaar has chosen to address this with workflows. You can explicitly have a branch set up to send all commits to a central location (bzr checkout), just as you do with SVN. Also, if two people checkout the same branch, they must stay in sync. (Bazaar actually has a stronger restriction here than SVN does, because SVN only complains if they modify the same files, whereas Bazaar requires that the whole tree be up to date.)

However, with a Bazaar checkout, there is always the possibility to either bzr unbind or just bzr commit --local when you are on a plane, or just want to record in-progress work before integrating it into the master branch.
SVN has a lot more 3rd party support.

SVN has just been around longer, and is pretty much the dominant open source centralized VCS. There are a lot of DVCSes at the moment, all doing things a little bit differently. Competition is good, but it makes it a bit more difficult to pick one over the other, and 3rd party tools aren't going to build for everyone.

However, Bazaar already has several good third party tools. For viewing changes to a single file, bzr gannotate can show when each line was modified, and what the associated commit message was. It even allows drilling back in history to prior versions of the file.
For viewing the branch history (showing all the merged branches, etc) there is bzr viz.
There are both gtk and qt GUIs, a Patch Queue Manager (PQM) for managing an
integration branch (where the test suite always must pass or the patch is
rejected.)
There is even basic Windows Shell integration (TortoiseBzr), a Visual Studio plugin, and an Eclipse plugin.
Bazaar is generally much easier to set up.

SVN can only really be set up by an administrator. Someone who has a bit more of an idea what they are doing. Setting up WebDAV over http is easier than it used to be, but it isn't something you would ask just anyone to do. Getting a project using Bazaar is usually as simple as bzr init; bzr add; bzr commit -m "initial import".

You can push and pull over simple transports (ftp, sftp, http).

Because SVN is centralized, you only really set it up one time anyway, so as long as you have one competent person on your team, you can probably get started.
It is easier to get 3rd party contributions.

If you give a user commit access to your SVN repository, then you have their changes available whenever they commit. But usually this also means that they have access to change things that you don't really want them to touch. (Yes, there are ACLs that you can set up, but I don't know many projects that go to that trouble for casual contributors.)

If you haven't given them commit access, then they have to work on their own, and the VCS doesn't give you a direct way to collaborate with them. You are back to using something like diff+patch.

Because Bazaar supports intelligent merging between "repositories" integrating other people's work is usually a bzr merge away. SVN 1.5 is supposed to address the merge issue, but at best it helps within a repository. So if someone is developing stuff on their own side, you are still stuck with diff + patch.

Just to reiterate, Bazaar can make it much easier for getting users to give "drive-by" contributions. Which can be a good stepping stone towards increasing your development community.
Subversion's model is a giant versioned filesystem. Bazaar uses a concept of a Tree.

I have little doubt that this made tracking merging more difficult in SVN, since there isn't a clear 'top' that has been merged with the other 'top'.

It also means that SVN commits aren't atomic in the same way that Bazaar commits are. In Bazaar, when you commit, you are guaranteed to be able to get back to that same revision. With SVN, if people are working on different files, both can commit, and when you checkout the final tree, it will not match either side.
This has some implications for assuring that the test suite passes on a given branch,
since the test suite can pass on my machine, and on their machine, but after we both commit, it won't pass after doing a checkout.
SVN supports partial checkouts better than Bazaar does.

This is mostly a consequence of the above point, rather than an explicit thing. But because SVN doesn't label anything as a special Tree, you can check out project/doc just as easily as project

We are looking into ways to at least fake this with Bazaar (we secretly check out the whole tree, but hide bits that you don't care about). Because we are aware of use cases where it is important. (A documentation team that doesn't want or need to see all the code, etc.)
SVN stores history on the server.

In the standard workflows, Bazaar has you copy the full project history to your local machine. For most projects, this isn't a big deal, because the delta compressed history is only a small multiple of a checked out tree. (Plus SVN always checks out 2 copies anyway.)
But there are times when people abuse the VCS, and check in a CD ISO (which gets deleted shortly thereafter). Suddenly you have more garbage data in your repository than you have desirable data.

Bazaar does have support for "lightweight checkouts" which are SVN style working directories. Where all the history is on the server, and only the working tree is local. Of course if you do this, you lose some flexibility (offline commits), but you get to chose when that fits your needs.

We also have "shared repositories" which can be used to share storage between branches. So even though you have 10 branches, you only have 1 copy of the history.

We are working on having a Shallow Branch/History Horizon which should be a very good compromise between the two. The basic idea is that it can pull down data that you are using, without needing the full history.
Storage of Binary Files

At the moment SVN's delta algorithm for binary files is able to give smaller deltas than ours does. This is likely to change in coming releases, but at the moment there will be times when SVN requires less disk space for binary files that you modify often. For binary files that change infrequently, or for compressed ones, there is likely to be less of a difference. (Most compressed formats don't delta well because a small change causes ripples in the compressed stream.)
Handling large files

At the moment, Bazaar has the expectation that you can fit a small number of copies of the contents of any file in memory. (The merge algorithm needs a BASE, THIS, and OTHER copy.)
So when you need to version 1GB movies, etc, SVN is probably a better choice at the moment. You might consider if it is actually the right way to handle those files.

We are certainly considering changing some parts of our code to be able to only read parts of files. But it is lower on our list of priorities.
Building up a project out of subprojects

At the moment SVN's externals handle more use cases than we do.
We are working on more complete support with Nested Trees. The internal data structures are present, but not all of the push/pull/merge/etc commands have been updated.

We already have good support for merging in a project into another project, so you get 1 large tree. And then you can continue to merge new changes from upstream, and it will apply to the correct files. However, once you have aggregated a project, it is harder to send any of your own changes upstream, independent of all the other files. (It is possible to do so, but it requires you to cherry pick the changes, and track when you modify which files.)

Also, Nested Trees are designed to allow you to easily checkout an exact copy of the full project at the exact revision of every sub-project, while still allowing you to 'bzr update' them to the current version of all the sub projects.
Clarity of "log"

One major difficulty with CVS is just figuring out what has been changing. With
Bazaar, you can do a simple bzr log and it shows you what has been
changing for the whole branch. SVN has a similar svn log which shows
you what has been changing underneath the current directory. (So they are
approximately the same,if you are in the root of an SVN branch.)

However, if you use feature branches to develop, and then have an integration
branch (trunk) with Bazaar you can do bzr log --short which shows only
the mainline revisions. In this case, that would be just the integration summary
messages. So you can see a single "merged feature X" message, rather than the
50 small commit messages that build up into that feature.
Plugin Architecture

One of Bazaar's main strengths is the ability for third party developers to add
commands or customization through the use of plugins. Plugins can provide
simple extensions (a different log format to conform to a companies particular
style expectations), new commands (history introspection, extra patch management,
integration with the PQM), or even support for a different repository format (at the
moment bzr-svn provides a way to treat an SVN repository as just another Bazaar branch, allowing you to push, pull and merge.)

While not every user is going to want to write a plugin, it does provide ways
for administrators to customize the behavior of Bazaar, so that the tool can be
slimmed down to provide just the basics, or expanded to provide specific
workflows customized to the situation.
Rename support

This is another place where SVN is much better than CVS, but Bazaar is even better still.
SVN has support for the basic concept of renaming, though it is implemented as a copy+delete pair. "copy" allows 2 files to have the same history prior to the point of copying. Which means commands like svn log and svn annotate use the full history of the file, but there is more that can be done.

One of the reasons projects hesitate to rename files, is because then it becomes difficult to accept changes from elsewhere. Suddenly the change has nowhere to go, because the target file is not there anymore. And this is where Bazaar has a distinct advantage over SVN. When you rename a file, Bazaar knows that any patches to that file belong in the new destination. Which means that when you need to refactor your code to clean up the overall structure, you can still merge changes that were created before the restructuring. I know I didn't realize how differently I worked with my code before I had the ability to fix simple name errors. (This file is 'Bars.c' when it should just be 'bar.c', etc.)

In summary, SVN may be a better choice if you have large binary files, projects with subprojects, need partial checkout support or more mature integration with 3rd party tools than Bazaar currently has. OTOH, if workflow flexibility is important, collaborating with others and increasing community participation matter, low administration is appealing or you care about quality branching/merging and correct rename handling, then Bazaar can help make life more enjoyable and ought to be seriously considered either now or in the future, depending on how comfortable you are with its maturity.

Monday, May 21, 2007

Ogg Vorbis and iTunes

I've been a longtime supporter of Ogg Vorbis, and I'm also a Mac user. While I haven't figured out how to get my iPod to play Ogg just yet, I have worked on getting iTunes to play it. I periodically do searches to see if things have improved, but they seem to return mostly old data.

So I just wanted to get it out that the good people at Xiph have started maintaining the Ogg Vorbis plugin. It is available here.

I don't seem to be able to find the page again, but I thought I read there were some small problems with the last release. They have development snapshots here. At least so far, I haven't run into any problems with it. And overall it seems to consume fewer CPU resources than the older releases.

Tuesday, April 3, 2007

New Launchpad UI

Just a quick mention of a new update to Launchpad.
The user interface has gone through a bit of an overhaul. Pages have been streamlined and unified by operation (overview, code, bugs, etc).

Some things are just glitter, but there is also some real meat to the update.

One piece of glitter is that the individual pages can have some custom branding of icons. For example, the bzr or jokosher pages have custom icons. Which stay with you when you go to the 'bugs' page, or the 'code' page. It is a small thing, but I think it does help you keep track of where you are at.

One more bit of useful bling is in the bug listing. If you see the Bazaar "merge" icon next to a bug, it means there is a Bazaar branch associated with it. So you can see what bugs someone is actively working on. And it gives people an easy way to get the potential bugfix, or possibly just subscribe to the bug (and/or branch changes) to follow along with how it is being fixed.

This tidbit is also shown in reverse on the branch listing. So when you are looking at the list of branches, you can see ones that are associated with bugs, and quickly jump to the bugs that they are addressing.

Overall, I'm happy to see the new design, and I think the Launchpad developers deserve a many kudos!

Friday, March 30, 2007

Java and Python

One thing that we are asked from time to time is if there is an Eclipse plugin for bzr. At the moment, there is a project which has been started: bzr-eclipse

It is still in the very early stages, but it seems there is enough interest, so I figured I would explore the space a bit.

One issue is trying to figure out how to communicate between bzr (written in Python) and Eclipse (written in Java).

One obvious method is to just write Java code which calls out to bzr the command line program, and then parses the string output from stdout and stderr. This can work, but bzr isn't especially scriptable. It can be scripted, but it is more focused on being something that is nice to use for a human than something that is easy to parse for a machine.

We have a much richer machine api in bzrlib the python library which is the guts of bzr. Wouldn't it be nice if we could get direct access to this rich API.

Well, there are two projects that I know of Jython and one I just heard about JEPP (Java Embedded Python).

Jython has the goal of running python code directly on the Java Virtual Machine. I'm not sure of everything that this entails, but my understanding is that it is basically writing a compiler that turns Python code into Java bytecodes. I have high hopes for this project, but at the moment it only supports Python 2.3 syntax (if you use the current beta). Unfortunately bzr is written with Python 2.4 syntax in mind. (We use decorators a lot and some generator comprehensions).

The other (major?) limitation is that Jython doesn't have a good way to support "os.chdir()". And while our general code doesn't actually use it, out test suite makes heavy use of "os.chdir()" to make sure that each test runs in isolation. Other limitations include not having a complete python standard library. Again, we use subprocess in the test suite when we want to ensure a clean run of bzr. We also use logging. There is also some concern about C extensions. At the moment, bzr is written in 100% python code, but as we finalize our data structures, we would like to implement any heavy processing loops in C/C++ (or possibly pyrex, which compiles to C).

But we could probably work around most of the missing functionality. The biggest thing is just Python 2.4 compatibility.

But this week I was exposed to Java Embedded Python or JEPP. Which takes the other approach. Rather than implementing the Python language in Java, just embed a CPython interpreter in a Java process.

This means you can use whatever CPython you have available on your system (2.3, 2.4, 2.5?). And you are sure to have access to the full standard library, extensions should never be a problem, etc.

The only real limitation of this approach is figuring out how well you can expose the embedded CPython interpreter. At a basic level, it isn't much different than calling 'python -c "do something"'. But it is possible to create a richer interaction between the CPython interpreter and the JVM, which is what JEPP is trying to do.

I played with JEPP today, and I think it is a really good start. It isn't functional enough yet that I would use it for a large project. But it seems almost there. At the moment it is able to return integers, floats, longs, and strings. But it isn't able to pass back and forth Python objects.

It does let you do stuff like:

Jep jep = new Jep(false, ".");
jep.runScript("a_python_script.py");

An the script can have quite a bit of logic. The script is run as '__main__', and the variables, functions, etc are in the running namespace. So you can do stuff like:

Object value = jep.getValue("variable");

or

Object ret = jep.invoke("a_function", "param1", 2, 3);

If "a_function" returns a "basic" type (int, long, float, str), then the returned Java Object is a Integer, Float, String, etc.

The only thing that doesn't work well is when the returned object is not a basic type. The code falls back to the catch-all, which converts everything to a string. I don't think this is the long term plan for the project, because they have a "PyObject" Java class.

I would expect the PyObject class to develop functions similar to Boost::Python's boost::python::object class.

I don't know if they will end up exposing as much of the api (slice is a nice convenience function, but logically maybe it shouldn't be on object), but ones like attr would certainly be useful. (As they also give you a way to call member functions, etc).

I know Boost does a lot of work behind the scenes with templates, and Java doesn't have the same functionality. I don't know if Java "Generics" are up to the task of PyObject(function).

Now I just have to figure out how to get commit notifications for a Sourceforge SVN project, so I can watch it evolve. :)

Tuesday, March 27, 2007

Test DRIVEN Development

For the Bazaar project we have a general goal that all code should be tested. We have an explicit development workflow that all new code must have associated tests before it can be merged into mainline.

Our latest release (0.15, rc3 is currently out the door, final should happen next week), introduces a new disk format, and a plethora of new tests. ('bzr selftest' in 0.14 has approx 4400 tests, and 5900 in 0.15). A lot of these are interface tests. Since we support multiple working tree, branch, and repository formats, we want to make sure that they all work the same way. (So only 1 tests is written, but it may be run against 4 or 5 different formats).

It means that we have a very good feeling that our code is doing what we expect it to (all of the major developers dogfood on the current mainline). However, it comes at a bit of a cost. In that running the full test suite gets slower and slower.

Further, I personally follow more of a 'test-after-development'. And I'm trying to get into the test driven development mindset. I don't know how I feel just yet, but I was reading this. And whether you agree with all of it, it makes it pretty clear how different the mindset can be. It goes through several iterations of testing, coding, and refactoring before it ends up anywhere I consider "realistic". And a lot of that comes at the 'refactoring' step, not at the coding step.

I have a glimpse at how it could be useful, as the idea is to have very small iterations. Such that it can be done in the 3-5 minute range. And every 3-5 minutes you should have a new test which passes. It means that you frequently have hard-coded defaults, since that is all the tests require at this point. But it might also help you design an interface, without worrying about actually implementing everything.

He also makes comments about keeping a TODO list. Which was part that made the most sense to me. Because you can't every write all the code fast enough to get all the ideas out of your head. So you keep a TODO so you don't forget, and also so you don't feel like you need to track down that path right now.

The other points that stuck with me are that most tests should be "unit tests". Which by his definition means they are memory only very narrow in scope. And that the test suite should be fast to run, because once it gets under a threshold (his comment was around 10 seconds, not minutes) then you can actually run all of them, all the time.

And since a development 'chunk' is supposed to be 3-5 minutes, it is pretty important that the test suite only take seconds to run. The 10s mark is actually reasonable, because it is about as long as you would be willing to give to that single task. Any longer and you are going to be context switching (email, more code, IRC, whatever).

The next level of test-size that he mentions is an "integration" test. I personally prefer the term "functional" test. But the idea is that a "unit" test should be testing the object (unit) under focus, and nothing else. Versus a functional test that might make use of other objects, and disk, database, whatever. And then the top level is doing an "end-to-end" test. Where you do the whole setup, test, and tear down. And these have purpose (like for conformance testing, or use case testing), but they really shouldn't be the bulk of your tests. If there is a problem here, it means your lower level tests are incomplete. They are good from a "the customer wants to be able to do 'X', this test shows that we do it the way they want" viewpoint.

I think I would like to try real TDD sometime, just to get the experience of it. I'll probably try it out on my next plugin, or some other small script I write. I have glimpses of how these sorts of things could be great. Often I'm not sure how to proceed while developing because the idea hasn't solidified in my head. One possibility here is "don't worry about it", create a test for what you think you want, stub out what you have to, and get something working.

Of course, the more I read, the more questions spring up. For example, there is a lot of discussion about test frameworks. Python comes with 'unittest', which is based on the general JUnit (or is it SUnit) framework. Where you subclass from a TestCase base class, and have a setUp(), and tearDown(), and a bunch of test_foo() tests.

But there is also nose and py.test, which both try to overcome unittest's limitations. And through reading about them, there is a discussion that python 3000 will actually have a slightly different default testing library. (For a sundry of technical and political reasons).

And then there is the mock versus stub debate. As near as I can tell, it seems to fall around how to create a unit test when the object under test depends on another object. And which method is more robust, easier to maintain, and easier to understand. That link lends some interesting thought about Mock objects. That instead of testing the state of objects, you are actually making an explicit assertion that the object being tested will make specific calls on the dependency.

I'm not settled on my decision, there. Because it feels like you are testing an exact implementation, rather than testing the side effect (interface). Some of what I read says "yes, that is what you are doing, and that is the point." I can understand testing side-effects. I guess part of it is how comfortable are you with having your test suite evolve. At least some tests need to be there to say that the interface hasn't changed since the previous release. (Or that a bug hasn't been reintroduced). If that edge case was tested by a particular test, and that test gets refactored, do you have confidence you didn't re-introduce the bug?

I guess you could have specific conventions about what tests are testing the current implementation, versus the overall interface of a function or class. I can understand that you want your test suite to evolve and stay maintainable. But at the other end, it is meant to test that things are conforming to some interface, so if you change the test suite, you are potentially breaking what you meant to maintain.

Maybe it just means you need several tiers of tests, each one less likely to be refactored.

Wednesday, March 14, 2007

Reading and Writing to Files ('r+', 'w+' mode) on Windows

It turns out that Windows has a small oddity when reading and writing to the a file. It is reported in the 'fopen' documentation at MSDN:
http://msdn2.microsoft.com/en-us/library/yeby3zcb(vs.71).aspx

The specific quote is:

When the "r+", "w+", or "a+" access type is specified, both reading and writing are allowed (the file is said to be open for "update"). However, when you switch between reading and writing, there must be an intervening fflush, fsetpos, fseek, or rewind operation. The current position can be specified for the fsetpos or fseek operation, if desired.

As an example, here is what you might do in python:

>>> f = open('test', 'wb+')
>>> f.write('initial text\n')
>>> f.close()
>>> f = open('test', 'rb+')
>>> f.read()
'initial text\n'
>>> f.write('this should go at the end\n')

On most platforms, that succeeds. But on Windows, if you don't do

>>> f.seek(0, 2) # Seek to the end of the file

before you call f.write(), you will get an IOError, with e.errno = 0. (Yeah, having an error of SUCCESS is a little hard to figure out).

Anyway, it took a while for me to figure out, so I figured I'd let other people know.

11 Steps to creating a new Launchpad Project

I frequently create new projects in launchpad, as I generally have a new "product" for every plugin that I write. I figured I would write down the specific steps I use, because there are a few non-obvious links that you need to use.

0) One quick point of terminology. Launchpad has the idea of "projects" and "products". A "project" is a collection of "products". For example we have the Bazaar project, which includes the "bzr" program, as well as plugins for "bzr". It is a little foreign to me, since I consider what I work on a "project" rather than a "product". But I understand the need for a higher level grouping, and I can't say that I have better names to distinguish them.

Also, each product gets a set of "series". These are generally meant along the lines of "release series". Most projects will have a development series (by default this is called "trunk"), and possibly some release series. Especially large projects, which will have concurrent development (think of Firefox, which has a 2.0 series, and an 1.5 series, since you get 1.5.1 and 2.0.1).

1) Go to Launchpad itself: https://launchpad.net

2) Go to the products page https://launchpad.net/products
The link on the main page is "register your product".

3) If this is an existing project, you probably want to search and make sure it isn't already registered in Launchpad. In my case, these are always new projects, so I don't worry about it.

4) Follow the "Register a Product" link on the upper left (https://launchpad.net/products/+new).

5) Fill out the basic information for the product. In my case, most of my products fall under the "bazaar" project banner. When creating a new plugin for "bzr", the general convention is to call it "bzr-plugin-name". It certainly isn't required, it is just a convention that I've tried to follow.

6) Change the product to use Malone (Launchpad's bug tracker) as the official bug tracker. This is the link "Define Launchpad Usage" on the left. (https://launchpad.net/PRODUCT/+launchpad). You may also enable Rosetta translations at this time.

7) Change the Maintainer of the product to a shared group. I usually want other people to be able to update the details of the product, update the bug tracker, etc. So I set the project as "owned" by the "bzr" group. That is done by following the "Change Maintainer" link (https://launchpad.net/PRODUCT/+reassign).

8) Now you want to create a Bazaar branch for the mainline of the project. You can do this through the "Register Branch" links. I personally tend to host my branches on Launchpad itself (hosting is free, and it is bandwidth I don't need to pay for). So I do a simple:


cd $local_branch
bzr push sftp://user@bazaar.launchpad.net/~bzr/PRODUCT/trunk

A bit of explanation, username must be your launchpad user name, and "~bzr" can be either your username, or the name of the group in step 7. As I mentioned, I prefer the mainline to be a shared branch, so other people can update the mainline if I'm too busy, or cannot be contacted for some reason.

9) Now update the "trunk" series to point to this new branch. There should be a link on the main page (https://launchpad.net/PRODUCT) to the "trunk" series. Or you can link more directly to it at (https://launchpad.net/PRODUCT/trunk).
You want to "Change Series Details" for this series (https://launchpad.net/PRODUCT/trunk/+edit).

10) At this point, you can change the name of the series (maybe you prefer "mainline" over "trunk"). You also can change the description. I usually leave them alone. What I do change is the "Branch". I generally follow the "Choose" link, which lets me search through all branches registered for this product. (Note, pushing to sftp://bazaar.launchpad.net/~USER/PRODUCT/BRANCH-NAME, will automatically register the branch)

11) And you're done. It took a little while, but now you have a fully functioning bug tracker and branch tracker. You are also able to tell people to get your product with:

 bzr branch lp:PRODUCT PRODUCT

And they will get the latest development version.

By registering your branches, you now have the ability to link them with bugs, so that users who find a bug, can see that there is already a fix, even if it hasn't been included in mainline yet.

(Edited to fix "Product" versus "Project")

Friday, March 2, 2007

Dirstate: another 2x performance boost

Another round of performance optimizations in the dirstate code brings us down from 15s down to 8s to do a complete 'bzr status' in a tree with 50,000 files and 5,000 directories. (bzr 0.14 takes approx 30s on the same tree).

There were a few tricks and a few cleanups.
1) Make one pass over the filesystem, rather than 2. We were making a second pass to check for unknown files rather than determining that in the first pass. It doesn't take as long as the first pass, since things are usually cached, but it is work that doesn't need to be done.

2) Work in raw filesystem paths when possible.

Internally in bzr we try to work in Unicode strings as much as possible. It makes things consistent across platforms (I can check in a file called جوجو.txt on windows, and have it show up with the correct filename on Mac and Linux). In fact, on Windows you need to use the Unicode api if you want to get the correct filenames. (They have an OEM name and a Unicode name, but if the characters are not in your codepage you get ????.txt in OEM mode).

However on Linux, if you want to use Unicode filenames, it has to decode every name that it finds (the difference between os.listdir('.') and os.listdir(u'.')).

With the dirstate refactoring, we now have a layer that can work in utf8 paths to find changes before it goes up to the next layer which can deal in Unicode paths for simplicity.

3) There is still more we can do. We are trying to continue doing this as a series of correctness preserving steps. But I am happy to say that we are getting some very good results after the last few months of effort. I honestly didn't think that the performance benefits would be this great this early.

Wednesday, February 28, 2007

Dirstate providing large performance benefits in the workingtree

A while back (how many months now?) we designed a new disk layout for managing the working trees. Basically, instead of storing an XML inventory for the current tree, and then another XML inventory for the basis tree, we decided that we could store a custom file format which would keep the information of basis next to the information for current.

The idea was that rather than spending a long time parsing 2 XML files, and then iterating through one, and looking up the basis in the other, we could save a lot of time by parsing through them at the same time.

It allows us to compare entries without having to create full python objects (1.8us versus approx .14us each or somewhere between 10-30x faster).

Further XML is not particularly fast to parse, especially in python. cElementTree is a very fast XML parser, but it still has to handle a lot of edge cases and decoding of strings. While a custom format can have things ready for use.

We have a lot more tuning to go, but so far we have been able to improve the "bzr status" time by about 2x. On a large (55k entry) tree, 'bzr status' time dropped from 30s down to 15s on my machine. I think we can get it down to around 5s.

We are still going through the test suite to make sure we have full coverage of all edge cases, and we are ensuring that the new format passes all tests. This is a pretty big effort. Right now we have 4,838 tests in our test suite, and our dirstate branch is up to 5,333 tests. Sometimes having a large test suite is a pain (it takes a while to run, and it can be a little bit more difficult to refactor code). But it makes up for the fact that you know your code is correct when it does pass.

Oh, and the current branch is at:
https://launchpad.net/~bzr/+branch/bzr/dirstate

for those who want to follow along at home.

Thursday, February 22, 2007

Tuple creation speed

Just a small note for a python performance bit that I've found. If you have a list of things, and you want to turn some of them into a tuple, it is faster to do so using

out = (lst[0], lst[1], lst[2])

rather than

out = tuple(lst[0:3])

When I think closer about it, it makes sense. Since the list slicing [0:3] actually has to create a new list and populate it, which you then convert into a tuple.

I don't know if this scales to very large tuples (100 entries), but certainly for tuples up 5 entries it holds true.

To test it yourself, you can use python's timeit module.


% python -m timeit -s "lst = range(50000)" "x = tuple(lst[10:15])"
1000000 loops, best of 3: 1.12 usec per loop
% python -m timeit -s "lst = range(50000)" "x = (lst[10], lst[11], lst[12], lst[13], lst[14])"
1000000 loops, best of 3: 0.862 usec per loop

That's almost a 30% improvement for a list with 5 entries.

Monday, February 5, 2007

Internationalized Emailing with Python

It turns out that creating internationalized emails (emails including characters not in US-ASCII) is quite a bit trickier than one would hope. There are a few relevant RFCs:
http://www.faqs.org/rfcs/rfc2822.html
http://www.faqs.org/rfcs/rfc2047.html
http://www.faqs.org/rfcs/rfc2231.html

All dealing with how you encode the content, as well as the headers. Mostly because everything was assumed to run on only 7-bit compliant systems so you have to encode the heck out of everything. For the body of the email, you just add a "Content-Type: ... charset=" field, which explains what charset (codepage, encoding, etc) the content is in.

However, that doesn't work for headers, because that would require another header to define the encoding of this header. So instead they decided that "=?utf-8?b?ZsK1?=" was a good way to encode "fµ".

This also wouldn't have been so bad, except they also decided that email addresses could not be escaped in this way, so you must use "=?utf-8?q?Joe_f=C2=B5?= " and not "=?utf-8?b?Sm9lIGbCtSA8am9lQGZvby5jb20+?=". (That is "Joe fµ " encoded as a single string).

So it turns out that the python standard library provides most of what you need, but you end up needing a bit of work to get it all together. So in the bzr-email plugin (which generates an email whenever you commit your changes) I did some work to make a nicer interface for creating and sending an email. Basically, it just assumes that you are going to use a Unicode string for the username + email address, splits them, and sets up all the right headers. It also handles connecting to a SMTP host, so you end up doing:

conn = SMTPConnection(config)
conn.send_text_email(u'Joe fµ ',
[u'Måry '],
u'Subject Hellø',
u'Body text\n')

This is especially important because Bazaar supports fully Unicode user names, commit messages, filenames, etc. (Which was tricky enough to get right because of all the complexities of Unicode.)

But I'm happy to say it all seems to be working. Now we just need to figure out how we would change the python standard library "email" library to make it easier for everyone else. (A further complication is that they changed the naming scheme between python2.4 and python2.5. It used to be "email.Utils" and it is now "email.utils". "email.MIMEText.MIMEText" became "email.mime.text.MIMEText".) Overall, I prefer the new layout, but it does mean you need a test import to work around it.

After doing the work, Marius Gedminas showed me where he had also run into the same situation:

http://mg.pov.lt/blog/unicode-emails-in-python.html

Saturday, February 3, 2007

Converting 212,000 revisions in ~12hrs

We have been working with the Mozilla team, to see if they can use Bazaar for their development.

We have been having some difficulties scaling up to such a large project (approx 200,000 revisions, 55,000 files.)

However, we have been in the process of tweaking our conversion utilities, as well as the internals of Bazaar to make it faster. We also have been able to exploit one of the semi-controversial features of bzr (file-ids) to allow conversion of pieces of the project, and still have it integrate as a whole.

By breaking up the Mozilla tree into a bunch of sub-modules, we were able to improve our conversion speed to around 4-5 revisions/second. Which lets us convert 210,000 revisions in under 14 hours. Woot!! (a whole-tree conversion before we spent a lot of time was running at about 0.02 revs/s (50s/rev) and was taking almost 2 months to convert).

To bring these separate trees back into a single tree, I wrote a plugin which simplifies merging projects together (especially when you want to merge one into a subdirectory of the other).
https://launchpad.net/bzr-merge-into

Basically it provides a new command:
bzr merge-into ../other/project subdir
Which will merge the 'other/project' tree into the local tree rooted at 'subdir'. After doing this, merging back and forth between other/project and this one is greatly simplified, since bzr records all of the information it needs to know which files should be mapped to which ones.

It even works if you end up moving the files around in the new destination.

Bazaar Structure

I just want to link a great post by David about the Bazaar working model:
http://ddaa.net/blog/bzr/repository-branch-tree

Some thing the model is a little complex. But really it just breaks down into 3 things.

A repository is where your historical information is stored.
A branch is a pointer into this historical information. As new work is done, its history is appended to an existing branch, which gives a view of how the work has evolved.
And a working tree is where all of the real work happens. (This is where you make changes, and actually develop *your* work). We take care of the rest, you deal with the working tree.

By having these as 3 separate concepts, we can build them up into several different (and all useful) arrangments.

New Blog

Well, hello everyone...

I'm going to try starting up this new blog area. And hopefully informing the world of progress we make in our distributed version control system: http://bazaar-vcs.org
Link

I'll try to make informative posts about distributed version control (DVC) in general, and naturally I'll mention specifics of things that happen in our part of that world.

I'm loathe to start in the middle, but since we've already come so far, I'm not sure that I can start at the beginning. But here goes...

John Arbash Meinel's Bazaar Blog

Blog Archive

Contributors