Monday, November 16, 2009

Memory Debugging with Meliae

Background of Meliae 0.1.0

Earlier this year I started working on a new memory debugging program for python. I had originally tried to use heapy, but at the time it didn't support Windows, Mac, or 64-bit environments. (Which turned out to be all of my interesting platforms.) The other major problem is that I'm often debugging memory consumption of up to a GB of active data. While I think some of the former issues have been fixed, the latter is still a major issue for me.

So with the help of Michael Hudson, I started putting together a new structure. The code would be split into a scanner and a processor (loader). Such that you can interrupt a running process, dump the memory consumption to disk, and then analyze it in a separate process. (Often after the former has stopped.) The scanner can have a minimal memory profile, so even if your system is already swapping, you can dump out the memory info. (Robert Collins successfully dumped a 6GB memory profile, though analyzing that beast is still an issue.) The other advantage of this system, is that I don't have to play tricks with objects that represent the current state, like Guppy does with all sorts of crazy decorators.

In recent months, I've also focused on improving Bazaar's memory profile, which also meant improving memory profiling. Enough that I felt it was worth releasing the code. So officially Meliae 0.1.0 has been released. (For those wondering about the name, it is from Ash-Wood Nymph in Greek Mythology, aka it is just a fun name.)

Doing real work
So how does one actually use the program. Bazaar has a very nice ability, that you can use SIGQUIT (Ctrl+|) or SIGBREAK (Ctrl+Pause/Break) to drop into a debugger in the middle of a process to figure out what is going on. At that point, you can just:
from meliae import scanner
scanner.dump_all_objects('filename.json')
(There is an alternative scanner.dump_gc_objects() which has even lower memory profile, but will dump some objects more than once, creating a larger dump file.)

This creates a file describing all of the Python objects it was able to find along with their known size, references, and for some objects (strings, ints) their content. From there, you start another shell, and use:
>>> from meliae import loader
>>> om = loader.load('filename.json')
>>> s = om.summarize(); s

This dumps out something like:
Total 17916 objects, 96 types, Total size = 1.5MiB (1539583 bytes)
Index Count % Size % Cum Max Kind
0 701 3 546460 35 35 49292 dict
1 7138 39 414639 26 62 4858 str
2 208 1 94016 6 68 452 type
3 1371 7 93228 6 74 68 code
4 1431 7 85860 5 80 60 function
5 1448 8 59808 3 84 280 tuple
6 552 3 40760 2 86 684 list
7 56 0 29152 1 88 596 StgDict
8 2167 12 26004 1 90 12 int
9 619 3 24760 1 91 40 wrapper_descriptor
10 570 3 20520 1 93 36 builtin_function_or_method
...

Showing the top objects and what data they consume. This can often be revealing it itself. Do you have millions of tuples? One giant dict that is consuming a surprising amount of memory? (A dict with 200k entries is ~6MB on a 32-bit platform.)

There is more that can be done. You can run:
om.compute_referrers()

At this point, you can look at a single node, and find out what was referencing it. (So what was referencing that largest dict?)
>>> om[s.summaries[0].max_address]
MemObject(29351984, dict, 49292 bytes, 1578 refs [...], 1 referrers [26683840])

>>> om[26683840]
MemObject(29337264, function, format_string, 60 bytes, 6 refs...)

However, it also turns out that all 'classic' classes in Python indirect to their data via self.__dict__, which is a bit annoying to walk through. It also makes it looks like 'dict' is the #1 memory consumer, when actually it might be instances of Foo, which happen to use dicts. So you can use
om.collapse_instance_dicts()

Which will find all instances that seem to have trivial references to a __dict__, and then collapse it so that all references are directly from the instance, and all referenced objects then claim the instance as the referrer.

The above dump changes to:
>>> s = om.summarize(); s
Total 17701 objects, 96 types, Total size = 1.5MiB (1539583 bytes)
Index Count % Size % Cum Max Kind
0 7138 40 414639 26 26 4858 str
1 486 2 394632 25 52 49292 dict
2 208 1 94016 6 58 452 type
3 1371 7 93228 6 64 68 code
4 1431 8 85860 5 70 60 function
5 149 0 82844 5 75 556 ReadLineTextBuffer
6 93 0 65384 4 79 6312 module
7 1448 8 59808 3 83 280 tuple
8 552 3 40760 2 86 684 list
9 56 0 29152 1 88 596 StgDict
10 2167 12 26004 1 90 12 int

Which shows that ReadLineTextBuffer is actually a large consumer of memory.

Misc
There are other bits to explore, and improvements to be made. "scanner.get_recursive_size()" can be useful if you don't want to dump out a big file to analyze memory referenced from a given object (such as a cache). It doesn't give the whole picture, but can be useful in an interactive session.


Results
In the end, this code has enabled us to cut the memory consumption of Bazaar
roughly in half (for bzr branch). It also lets you see nice summaries
like this:

Total 2805995 objects, 276 types, Total size = 946.0MiB (991983819 bytes)
Index Count % Size % Cum Max Kind
0 1939090 69 916011611 92 92 5762600 str
1 9449 0 33069868 3 95 3145868 dict
2 132202 4 12506732 1 96 536 unicode
3 383436 13 7048652 0 97 20 bzrlib._static_tuple_c.StaticTuple
4 160027 5 5873744 0 98 304 tuple
5 5429 0 5185252 0 98 412236 list
6 62256 2 4482432 0 99 72 InventoryFile
7 148 0 1334032 0 99 1048692 set
8 2185 0 1214860 0 99 556 GroupCompressBlock
9 8003 0 992372 0 99 124 CHKInventoryDirectory
...


(Note that after seeing this, we changed the code to not cache as many strings in memory, and I managed to decrease memory consumption to about 1/3rd it once was for this operation.)

The code isn't perfect, but being able to get a view of where memory is going, and what objects are holding on to it, is a huge improvement over just being in the dark.

Thursday, October 15, 2009

The Joys of multiple releases

I had originally written a longer post over at wordpress, only to have Firefox crash while trying to move an image, and WP doesn't do auto-saving like blogger. So now I'm back...

Bazaar 2.0.1 and 2.1.0b1 have now 'gone gold' in that I've uploaded the official tarballs, and asked people to make installers for them. Once installers are made, then we'll make the official announcement.

For those who haven't been following, Bazaar has now split its releases into 2 series. The 2.0.x series is based on 2.0.0 and has only bugfixes. Things that could cause compatibility problems (new features, removal of deprecated code, etc.) is only done in the 2.1.0.x series. We're hoping that this can give people some flexibility, as well as giving us more flexibility. In the past, we've suffered a bit trying to maintain backwards compatibility for some features/bugfixes, only to break compatibility for a big feature. Instead of suffering the worst of both, we're trying to get the best of both. If something needs to break compatibility, it just goes in the dev branch. Note that the development branch is still considered 'stable', in that the test suite always passes, and the code is pretty much always ready for a release. We just don't make the same guarantees about stable internal apis for 3rd parties to use.

The other change to the process is to stop doing as many "release candidate" builds. Instead, we will just cut a release. If there are problems, we'll cut the next release sooner. The chance for regressions in the 'bugfix-only' 2.0.x series should be low, and getting away from pre-builds means less overhead. We will still be doing releases we call 'rc1' before the next major stable release (2.1.0), and in that vein we expect to do little-to-no changes from the rc1 to the final build.

However, this new system does increase overhead for a single release. As now it is equivalent to doing the rc and the final in the same day. Also, because we now have 2 "integration" branches, it requires a bit more coordination between them.

For example, this is the revision graph for the recent 2.0.1 and 2.1.0b1 release

The basic workflow that I used was something like
  1. Have a LOSA create 2 release branches lp:~bzr-pqm/bzr/2.0.1 and lp:~bzr-pqm/bzr/2.1.0b1
  2. Create a local branch of each
  3. Create another branch for doing my updates in, such as lp:~jameinel/bzr/2.0.1
  4. Update 2.0.1 with a new version string
  5. Update NEWS to clean it up, show that there is an official release, and provide a summary/overview of the changes.
  6. Land this update into the official 2.0.1 branch via PQM. (Unfortunately this can take up to 2 hours depending on a bunch of different factors. We are trying to get this down to more like 10 min.)
  7. Update my local copy from the final release. Tag it (bzr-2.0.1).
  8. Create the tarball
  9. Create the release launchpad
  10. Upload the tarball to the release
  11. While this is going on, go through the bugtracker and make sure that things mentioned in NEWS have the appropriate "Fix Released" state in the bug tracker, as well as being associated with the right milestones. With 34 bugfixes, this is a non-trivial undertaking.
  12. Merge the 2.0.1 final release into the 2.1.0b1 branch. (All bugfixes in the stable series are candidates for merging at any time into the development series.)
  13. Do lots of cleanup in NEWS. The main difficulty here is that bugfixes are present on 2 integration branches simultaneously, and those releases are slightly independent. We've talked about having the bugfix mentioned in both sections. Which would be more important if we ever make a development release without doing the corresponding stable release.
  14. Do steps 4-10 again for 2.1.0b1.
  15. While working or waiting on that, prepare lp:~bzr-pqm/bzr/2.0 since it is now going to be prepped for 2.0.2. This involves, bumping the version number, updating NEWS with blank entries for the next release (avoids some conflicts for people landing changes in that branch), and submitting all of that back to PQM.
  16. When that has finished, bring the 2.0 stable branch back into bzr.dev. And prepare bzr.dev for 2.1.0b2. (version number bumps, NEWS cleanups, etc.)
  17. In this case, cleaning up NEWS was again a bit of a chore. As now you have a file that should have a blank area for both the 2.1.0b2 changes, but also the 2.0.2 changes. Further, some of the changes that landed in bzr.dev in the mean-time, were not included in the 2.1.0b1 release. So you have to move them up into the new section. Getting NEWS right across 4 branches was quite a bit of work, and probably the hardest part (so far) of the process. Copy & Paste + bzr diff + bzr vimdiff we quite helpful here. Setting the news in bzr.dev to the exact copy from 'bzr-2.1.0b1' and then showing what was removed/added was a nice way to make sure to get everything .
  18. breathe
  19. Announce the tarballs, etc on the bzr mailing list, so that people can start preparing packages/installers.
  20. I'm also the windows installer packager, so I get to build around 8 installers... (standalone installers, and 3 python [2.4, 2.5, 2.6] installers.) Most of this is scripted, but often something breaks along the way.
  21. Update Pypi, freshmeat, ... with updates for the new versions. Twice. (At least here we did not update for 'rc' versions. So this work is strictly doubled.)
Overall, it is a fair amount of work. I think it will amount to 1-2 days of work time (spread out over 3+ days of real time). With any luck, it will amount to being 'more concentrated, but less often'. I would say that we'd get more practice, but we also try to rotate release managers. Both to spread the knowledge, and to avoid burnout. (Though Martin has been doing the last 4-or-so releases...)

So here's to everyone upgrading to their preferred release (in about a week's time).

Thursday, October 8, 2009

Refactoring work for review (and keep your annotations)

Tim Penhey recently had a nice post about how he split up his changes to make it easier to review. His method used 'bzr pipeline' and some combinations of shelving, merging, and reverting the merges.

However, while I wanted to refactor my changes to make it easier to review, I didn't want to lose my annotation history. So I took a different approach.

To start, with, I'll assume you have a single branch with lots of changes, each wrt different features. You developed them 'concurrently' (jumping back and forth between features, without actually committing it to a different branch). And now that you are done, you want to split them out again.

There are a lot of possible ways that you can do this. With some proponents prefering a 'rebase' style. Where you replay the commits you made in a new order, possibly squashing them, etc. I'm personally not a big fan of that.
Tim's is another method, where you just cherrypick the changes into new branches, and use something like bzr-pipeline to manage the layering. However in reading his workflow, he would also lose the history of the individual changes.

So this is my workflow.
  1. Start with a branch that has a whole lot of changes on it, and is essentially 'done'. We'll call this branch "dogpile".
  2. Create a new branch from it (bzr branch --switch ../dogpile ../feature1), and remove all of the changes but the 'first step'. I personally did that with "bzr revert -r submit: file1 file2 file3" but left "file4" alone.
  3. "bzr commit" in that branch. The delta for that revision will show a lot of your hard-worked on changes being removed. However "bzr diff -r submit:" should show a very nice clean patch that only includes the changes for "feature1".
  4. Go back to the original dogpile branch, and create a new "feature2" branch. (bzr branch --switch ../dogpile ../feature2)
  5. Now merge the "feature1" branch (bzr merge ../feature1). At this point, it looks like everything has been removed except for the bits for feature1. However, just using "bzr revert file2..." we can restore the changes for "feature2".
  6. You can track your progress in a few ways. "bzr diff -r submit:" will show you the combine differences from feature1 and feature2. "bzr diff -r -1:../feature1" will show you just the differences between the current feature2 branch and the feature1 branch. The latter is what you want to be cleaning up, so that it includes all of your feature2 changes, built on top of your feature1 changes. You also have the opportunity to tweak the code a bit, and run the test suite to make sure things are working correctly.
  7. "bzr commit" in that branch. At this point, the diff from upstream to feature1 should be clean, and the diff from feature1 => feature2 should be clean. As an added benefit, doing "bzr annotate file2" will preserve all the hard-won history of the file.
  8. repeat steps 4-7 for all the other features you wanted to split out into their own branches.
When you are done, you will have N feature branches, split up from the original "dogpile" branch. By using the "merge + revert things back into existence" trick, you can preserve all of the annotations for your files. This works because you have 2 sources that the file content could come from. One source is the "dogpile" branch, and the other source is a branch where "dogpile" changes were removed. Since the changes are present in one of the parents, the annotations are brought from there.

This is what the 'qlog' of my refactoring looks like.

The actual content changes (the little grey dots) actually span about 83 commits. However, you can see that I split that up into 6 new branches (some more independent than others), all of which generate a neat difference to their parent, and preserve all of the annotation information from the full history. You can also see that now that I have it split out, I can do simple changes to each branch (notice that purple has an extra commit). This will most likely come into play if people ask for any changes during review.

Monday, March 23, 2009

brisbane-core

Jonathan Lange decided to drop some hints about what is going on in Bazaar, and I figured I could give a bit more detail about what is going on. "Brisbane-core" is the code name we have for our next generation repository format, since we started working on it in our November sprint in Brisbane last year.

I'd like to start by saying we are really excited about how things are shaping up. We've been doing focused work on it for at least 6 months now. Some of the details are up on our wiki for those who want to see how it is progressing.

To give the "big picture" overview, there are 2 primary changes in the new repository layout.
  1. Changing how the inventory is serialized. (makes log -v 20x faster)
  2. Changing how data is compressed. (means the repository becomes 2.5:1 smaller, bzr.dev now fits in 25MB down from 100MB, MySQL fits in 170MB down from 500MB)
The effect of these changes is both much less disk space used (which also affects number of bytes transmitted for network operations), and faster delta operations (so things like 'log -v' are now O(logN) rather than O(N), or 20x faster on medium sized trees, probably much faster on large trees).


Inventory Serialization


The inventory is our meta-information about what files are versioned and what state each file is at, (git calls it a 'tree', mercurial calls it the 'changelog'). Before brisbane-core, we treated the inventory as one large (xml) document, and we used the same delta algorithm as user files to shrink it when writing it to the repository. This works ok, but for large repositories, it is effectively a 2-4MB file that changes on every commit. The delta size is small, but the uncompressed size is very large. So to make it store efficiently, you need to store a lot of deltas rather than fulltexts, which causes your delta chain to increase, and makes extracting a given inventory slower. (Under certain pathological conditions, the inventory can actually take up more than 50% of the storage in the repository.)

Just as important as disk consumption, is that when you go to compare two inventories, we would then have to deserialize two large documents into objects, and then compare all of the objects to see what has and has not changed. You can do this in sorted order, so it is O(N) rather than O(N^2) for a general diff, but it still means looking at every item in a tree, so even small changes take a while to compute. Also, just getting a little bit of data out of the tree, meant reading a large file.

So with brisbane-core, we changed the inventory layer a bit. We now store it as a radix tree, mapping between file-id and the actual value for the entry. There were a few possible designs, but we went with this, because we knew we could keep the tree well balanced, even if users decide to do strange things with how they version files. (git, for example, uses directory based splitting. However if you have many files in one dir, then changing one record rewrites entries for all neighbors, or if you have a very deep directory structure, changing something deep has to rewrite all pages up to the root.) This has a few implications.

1) When writing a new inventory, most of the "pages" get to be shared with other inventories that are similar. So while conceptually all information for a given revision is still 4MB, we now share 3.9MB with other revisions. (Conceptually, the total uncompressed data size is now closer to proportional to the total changes, rather than tree size * num revisions.)

2) When comparing two inventories, you can now safely ignore all of those pages that you know are identical. So for two similar revisions, you can find the logical difference between them by looking at data proportional to the difference, rather than the total size of both trees.


Data Compression

At the same time that we were updating the inventory logic, we also wanted to improve our storage efficiency. Right now, we store a line-based delta to the previous text. This works ok, but there are several places where it is inefficient.
  1. To get the most recent text, you have to apply all of the deltas so far. Arguably the recent text is more often accessed than an old text, but it is the slower text to get. To offset this, you can cap the maximum number of deltas before you insert a fulltext. But that also affects your storage efficiency.
  2. Merges are a common issue. As one side of the merge will have N deltas representing the changes made on that side. When you then merge, you end up with yet another copy of those texts. Imagine two branches, each changing 10 lines, when you merge them, if a delta could point at either parent, you could have a copy of 10 lines, but the other 10 lines looks like a new insert. Thought of another way, after a merge you have many lines that have existed in other revisions, but never in the same combination. Comparing against any single text would always be inefficient.
  3. Cross file compression. As a similar issue to the 'single parent' in (2), there are also times when you have texts that don't share a common ancestry, but actually have a lot of lines in common. (Like all of the copyright headers)
There are lots of potential solutions for these, but the one we went with we are calling "groupcompress". The basic idea is that you build up a "group" of texts that you compress together. We start by inserting a fulltext for the most recent version of a file. We then start adding ancestors of the file to the group, generating a delta for the changes. The lines of the delta are then used as part of the source when computing the next delta. Once enough texts have been added to a group, we then pass the whole thing through another compressor (currently zlib, though we are evaluating lzma as a "slower but smaller" alternative).

As an example, say you have 3 texts.
text1:
first line
second line
third line

text2:
first line
modified second line
third line

text3:
first line
remodified second line
third line

So if you insert text3 at the start, when you insert text2 you end up with a delta that inserts "first line" into the stream. When you get to text1, you then can copy the bytes for "first line" from text2, and "third line" from text3.

There are a few ways to look at this. For example, one can consider that the recipe for extracting text1, is approximately the same as if you used a simple delta for text3 => text2, and then another delta for text2 => text1. The primary difference is that the recipe has already combined the two deltas together. The main benefit is that to extract text3, you don't have to create the intermediate text2.

One downside to storing the expanded recipes is that there is some redundancy. Consider that both text2 and text1 will be copying "first line" from text3. In short examples, this isn't a big deal, but if you have 100s of texts in a row, the final recipe will look very similar to the previous one, and they will be copy instructions from a lot of different regions. (Development tends to add lines, so storing things in reverse order means those lines look like deletions. Removing lines splits a copy command into 2 copy commands for the lines before and the lines after.)

A lot of that redundancy is removed by the zlib pass. By doing the delta compression first, you can still get good efficiency from zlib's 32kB window. The other thing we do is analyze the complexity of the recipe. If the recipe starts becoming too involved, we will go ahead and insert a new fulltext, which then becomes a source for all the other texts that follow. There are lots of bits that we can tune here. The most important part right now is making sure that the storage is flexible to allow us to change the compressor in the future, without breaking old clients.


What now?

The branch where the work is being integrated is available from:
bzr branch lp:~bzr/bzr/brisbane-core
cd brisbane-core
make

For now, the new repository format is available as "bzr init-repo --format=gc-chk255-big", but is considered "alpha". Meaning it passes tests, but we reserve the right to change the disk format at will. Our goal is to get the format to "beta" within a month or so. At which point it will land in bzr.dev (and thus the next release) as a "--development" format (also available in the nightly ppa). At that point, we won't guarantee that the format will be supported a year from now, but we guarantee to allow support converting data out of that format. (So if we release --development5, there will be an upgrade path to --development6 if we need to bump the disk format.)

Going further, we are expecting to make it our default format by around June, 2009.

At this point changes are mostly polish and ensuring that all standard use cases do not regress in performance. (Operations that are O(N) are often slower in a split layout, because you spend time bringing in each page. But if you can change them into O(logN) the higher constants don't matter anymore.)