Wednesday, February 28, 2007

Dirstate providing large performance benefits in the workingtree

A while back (how many months now?) we designed a new disk layout for managing the working trees. Basically, instead of storing an XML inventory for the current tree, and then another XML inventory for the basis tree, we decided that we could store a custom file format which would keep the information of basis next to the information for current.

The idea was that rather than spending a long time parsing 2 XML files, and then iterating through one, and looking up the basis in the other, we could save a lot of time by parsing through them at the same time.

It allows us to compare entries without having to create full python objects (1.8us versus approx .14us each or somewhere between 10-30x faster).

Further XML is not particularly fast to parse, especially in python. cElementTree is a very fast XML parser, but it still has to handle a lot of edge cases and decoding of strings. While a custom format can have things ready for use.

We have a lot more tuning to go, but so far we have been able to improve the "bzr status" time by about 2x. On a large (55k entry) tree, 'bzr status' time dropped from 30s down to 15s on my machine. I think we can get it down to around 5s.

We are still going through the test suite to make sure we have full coverage of all edge cases, and we are ensuring that the new format passes all tests. This is a pretty big effort. Right now we have 4,838 tests in our test suite, and our dirstate branch is up to 5,333 tests. Sometimes having a large test suite is a pain (it takes a while to run, and it can be a little bit more difficult to refactor code). But it makes up for the fact that you know your code is correct when it does pass.

Oh, and the current branch is at:

for those who want to follow along at home.

Thursday, February 22, 2007

Tuple creation speed

Just a small note for a python performance bit that I've found. If you have a list of things, and you want to turn some of them into a tuple, it is faster to do so using

out = (lst[0], lst[1], lst[2])

rather than

out = tuple(lst[0:3])

When I think closer about it, it makes sense. Since the list slicing [0:3] actually has to create a new list and populate it, which you then convert into a tuple.

I don't know if this scales to very large tuples (100 entries), but certainly for tuples up 5 entries it holds true.

To test it yourself, you can use python's timeit module.

% python -m timeit -s "lst = range(50000)" "x = tuple(lst[10:15])"
1000000 loops, best of 3: 1.12 usec per loop
% python -m timeit -s "lst = range(50000)" "x = (lst[10], lst[11], lst[12], lst[13], lst[14])"
1000000 loops, best of 3: 0.862 usec per loop

That's almost a 30% improvement for a list with 5 entries.

Monday, February 5, 2007

Internationalized Emailing with Python

It turns out that creating internationalized emails (emails including characters not in US-ASCII) is quite a bit trickier than one would hope. There are a few relevant RFCs:

All dealing with how you encode the content, as well as the headers. Mostly because everything was assumed to run on only 7-bit compliant systems so you have to encode the heck out of everything. For the body of the email, you just add a "Content-Type: ... charset=" field, which explains what charset (codepage, encoding, etc) the content is in.

However, that doesn't work for headers, because that would require another header to define the encoding of this header. So instead they decided that "=?utf-8?b?ZsK1?=" was a good way to encode "".

This also wouldn't have been so bad, except they also decided that email addresses could not be escaped in this way, so you must use "=?utf-8?q?Joe_f=C2=B5?= " and not "=?utf-8?b?Sm9lIGbCtSA8am9lQGZvby5jb20+?=". (That is "Joe fµ " encoded as a single string).

So it turns out that the python standard library provides most of what you need, but you end up needing a bit of work to get it all together. So in the bzr-email plugin (which generates an email whenever you commit your changes) I did some work to make a nicer interface for creating and sending an email. Basically, it just assumes that you are going to use a Unicode string for the username + email address, splits them, and sets up all the right headers. It also handles connecting to a SMTP host, so you end up doing:

conn = SMTPConnection(config)
[u'Måry '],
u'Subject Hellø',
u'Body text\n')

This is especially important because Bazaar supports fully Unicode user names, commit messages, filenames, etc. (Which was tricky enough to get right because of all the complexities of Unicode.)

But I'm happy to say it all seems to be working. Now we just need to figure out how we would change the python standard library "email" library to make it easier for everyone else. (A further complication is that they changed the naming scheme between python2.4 and python2.5. It used to be "email.Utils" and it is now "email.utils". "email.MIMEText.MIMEText" became "email.mime.text.MIMEText".) Overall, I prefer the new layout, but it does mean you need a test import to work around it.

After doing the work, Marius Gedminas showed me where he had also run into the same situation:

Saturday, February 3, 2007

Converting 212,000 revisions in ~12hrs

We have been working with the Mozilla team, to see if they can use Bazaar for their development.

We have been having some difficulties scaling up to such a large project (approx 200,000 revisions, 55,000 files.)

However, we have been in the process of tweaking our conversion utilities, as well as the internals of Bazaar to make it faster. We also have been able to exploit one of the semi-controversial features of bzr (file-ids) to allow conversion of pieces of the project, and still have it integrate as a whole.

By breaking up the Mozilla tree into a bunch of sub-modules, we were able to improve our conversion speed to around 4-5 revisions/second. Which lets us convert 210,000 revisions in under 14 hours. Woot!! (a whole-tree conversion before we spent a lot of time was running at about 0.02 revs/s (50s/rev) and was taking almost 2 months to convert).

To bring these separate trees back into a single tree, I wrote a plugin which simplifies merging projects together (especially when you want to merge one into a subdirectory of the other).

Basically it provides a new command:
bzr merge-into ../other/project subdir
Which will merge the 'other/project' tree into the local tree rooted at 'subdir'. After doing this, merging back and forth between other/project and this one is greatly simplified, since bzr records all of the information it needs to know which files should be mapped to which ones.

It even works if you end up moving the files around in the new destination.

Bazaar Structure

I just want to link a great post by David about the Bazaar working model:

Some thing the model is a little complex. But really it just breaks down into 3 things.

A repository is where your historical information is stored.
A branch is a pointer into this historical information. As new work is done, its history is appended to an existing branch, which gives a view of how the work has evolved.
And a working tree is where all of the real work happens. (This is where you make changes, and actually develop *your* work). We take care of the rest, you deal with the working tree.

By having these as 3 separate concepts, we can build them up into several different (and all useful) arrangments.

New Blog

Well, hello everyone...

I'm going to try starting up this new blog area. And hopefully informing the world of progress we make in our distributed version control system:
I'll try to make informative posts about distributed version control (DVC) in general, and naturally I'll mention specifics of things that happen in our part of that world.

I'm loathe to start in the middle, but since we've already come so far, I'm not sure that I can start at the beginning. But here goes...