Friday, March 2, 2007

Dirstate: another 2x performance boost

Another round of performance optimizations in the dirstate code brings us down from 15s down to 8s to do a complete 'bzr status' in a tree with 50,000 files and 5,000 directories. (bzr 0.14 takes approx 30s on the same tree).

There were a few tricks and a few cleanups.
1) Make one pass over the filesystem, rather than 2. We were making a second pass to check for unknown files rather than determining that in the first pass. It doesn't take as long as the first pass, since things are usually cached, but it is work that doesn't need to be done.

2) Work in raw filesystem paths when possible.

Internally in bzr we try to work in Unicode strings as much as possible. It makes things consistent across platforms (I can check in a file called جوجو.txt on windows, and have it show up with the correct filename on Mac and Linux). In fact, on Windows you need to use the Unicode api if you want to get the correct filenames. (They have an OEM name and a Unicode name, but if the characters are not in your codepage you get ????.txt in OEM mode).

However on Linux, if you want to use Unicode filenames, it has to decode every name that it finds (the difference between os.listdir('.') and os.listdir(u'.')).

With the dirstate refactoring, we now have a layer that can work in utf8 paths to find changes before it goes up to the next layer which can deal in Unicode paths for simplicity.

3) There is still more we can do. We are trying to continue doing this as a series of correctness preserving steps. But I am happy to say that we are getting some very good results after the last few months of effort. I honestly didn't think that the performance benefits would be this great this early.

No comments: