Wednesday, February 28, 2007

Dirstate providing large performance benefits in the workingtree

A while back (how many months now?) we designed a new disk layout for managing the working trees. Basically, instead of storing an XML inventory for the current tree, and then another XML inventory for the basis tree, we decided that we could store a custom file format which would keep the information of basis next to the information for current.

The idea was that rather than spending a long time parsing 2 XML files, and then iterating through one, and looking up the basis in the other, we could save a lot of time by parsing through them at the same time.

It allows us to compare entries without having to create full python objects (1.8us versus approx .14us each or somewhere between 10-30x faster).

Further XML is not particularly fast to parse, especially in python. cElementTree is a very fast XML parser, but it still has to handle a lot of edge cases and decoding of strings. While a custom format can have things ready for use.

We have a lot more tuning to go, but so far we have been able to improve the "bzr status" time by about 2x. On a large (55k entry) tree, 'bzr status' time dropped from 30s down to 15s on my machine. I think we can get it down to around 5s.

We are still going through the test suite to make sure we have full coverage of all edge cases, and we are ensuring that the new format passes all tests. This is a pretty big effort. Right now we have 4,838 tests in our test suite, and our dirstate branch is up to 5,333 tests. Sometimes having a large test suite is a pain (it takes a while to run, and it can be a little bit more difficult to refactor code). But it makes up for the fact that you know your code is correct when it does pass.

Oh, and the current branch is at:

for those who want to follow along at home.

No comments: