John Arbash Meinel's Bazaar Blog

Wednesday, August 4, 2010

Step-by-step Meliae

Some people asked me to provide a step-by-step guide to how to debug memory using Meliae. I just ran into another unknown situation, so I figured I'd post a step-by-step along with rationale for why I'm doing it.

First is loading up the data. This was a dump while running 'bzr pack' of a large repository.
```
>>> from meliae import loader
>>> om = loader.load('big.dump')
>>> om.remove_expensive_references()
```
The last step is done because otherwise instances keep a reference to their class, and classes reference their base types, and you end up getting to 'object', and somewhere along the way you end up referencing too much. I don't do it automatically, because it does remove actual references, which someone might want to keep.

Then, do a big summary, just to get started

>>> om.summarize()
Total 8364538 objects, 286 types, Total size = 440.4MiB (461765737 bytes)
Index   Count   %      Size   % Cum     Max Kind
0 2193778  26 181553569  39  39 4194281 str
1   12519   0  97231956  21  6012583052 dict
2 1599439  19  68293428  14  75     304 tuple
3 3459765  41  62169616  13  88      20 bzrlib._static_tuple_c.StaticTuple
4      82   0  29372712   6  94 8388724 set
5 1052573  12  12630876   2  97      12 int
6    1644   0   4693700   1  98 2351848 list
7    4038   0   2245128   0  99     556 _LazyGroupCompressFactory

You can see that
1. There are 8M objects, and about 440MB of reachable memory.
2. The vast bulk of that is in strings, but there are also some oddities, like that 12.5MB dictionary
At this point, I wanted to understand what was up with that big dictionary.
```
>>> dicts = om.get_all('dict')
>>> dicts[0]
dict(417338688 12583052B 1045240refs 2par)
```
om.get_all() gives you a list of all objects matching the given type string. It also sorts the returned list, so that the biggest items are at the
beginning.

Now lets look around a bit, to try to figure out where this dict lives

>>> bigd = dicts[0]
>>> from pprint import pprint as pp
We'll use pprint a lot, so map it to something easy to type.
>>> pp(bigd.p)
[frame(39600120 464B 23refs 1par '_get_remaining_record_stream'),
_BatchingBlockFetcher(180042960 556B 17refs 3par)]

So this dict is contained in a frame, but also an attribute of _BatchingBlockFetcher. Let's try to see which attribute it is.

>>> pp(bigd.p[1].refs_as_dict())
{'batch_memos': dict(584888016 140B 4refs 1par),
'gcvf': GroupCompressVersionedFiles(571002736 556B 13refs 9par),
'keys': list(186984208 16968B 4038refs 2par),
'last_read_memo': tuple(536280880 40B 3refs 1par),
'locations': dict(417338688 12583052B 1045240refs 2par),
'manager': _LazyGroupContentManager(584077552 172B 7refs 3716par),
'memos_to_get': list(186983248 52B 1refs 2par),
'total_bytes': 774119}

It takes a bit to look through that, but you can see:
```
'locations': dict(417338688 12583052B 1045240refs 2par)
```
Note that 1045240refs means there are 522k key:value pairs in this dict.

How much total memory is this dict referencing?

>>> om.summarize(bigd)
Total 4035636 objects, 22 types, Total size = 136.8MiB (143461221 bytes)
Index   Count   %      Size   % Cum     Max Kind
0 1567864  38  66895512  46  46      52 tuple
1  285704   7  24972909  17  64     226 str
2 1142424  28  20757800  14  78      20 bzrlib._static_tuple_c.StaticTuple
...
8       2   0      1832   0  99    1684 FIFOCache
9      35   0      1120   0  99      32 _InternalNode

So about 136MB out of 440MB is reachable from this dict. However, I'm noticing that FIFOCache and _InternalNode is also reachable, and those don't really seem to fit. I also notice that there are 1.6M tuples here, which is often a no-no. (If we are going to have that many tuples, we probably want them to be StaticTuple() because they use a fair amount less memory, can be interned, and aren't in the garbage collector. So lets poke around a little bit
```
>>> bigd[0]
bzrlib._static_tuple_c.StaticTuple(408433296 20B 2refs 9par)
>>> bigd[1]
tuple(618390272 44B 4refs 1par)
>>> pp(bigd[0].c)
[str(40127328 80B 473par 'svn-v4:138bc75d-0d04-0410-961f-82ee72b054a4:trunk:126948'),
str(247098672 85B 37par '14@138bc75d-0d04-0410-961f-82ee72b054a4:trunk%2Fgcc%2Finput.h')]
>>> pp(bigd[1].c)
[tuple(618383880 36B 2refs 1par),
bzrlib._static_tuple_c.StaticTuple(569848240 16B 1refs 3par),
NoneType(505223636 8B 1074389par),
tuple(618390416 48B 5refs 1par)]
```
One thing to note, dict references are [key1, value1, key2, value2] while tuple references are (last, middle, first). I don't know why tuple.tp_traverse traverses in reverse order, but it does. And StaticTuple followed its lead.
The things to take away from this is
1. It is mapping a StaticTuple(file_id, revision_id) => tuple()
2. The target tuple is actually quite complex, so we'll have to dig a bit deeper to figure it out.
3. The file-id and revision-id are both referenced many times (37 and 473 respectively), so we seem to be doing a decent job sharing those strings.
At this point, I would probably pull up the source code for _BatchingBlockFetcher, to try and figure out what is so big for locations. Looking at the source code, it is actually built in _get_remaining_record_stream as:
```
locations = self._index.get_build_details(keys)
```
This is then defined as returning:
```
  :return: A dict of key: (index_memo, compression_parent, parents, record_details).
```

And the index memo contains a reference to the indexes themselves, but they don't really 'own' them. So lets filter them out:

>>> indexes = om.get_all('BTreeGraphIndex')
>>> om.summarize(bigd, excluding=[o.address for o in indexes])
Total 3740667 objects, 6 types, Total size = 122.9MiB (128855911 bytes)
Index   Count   %      Size   % Cum     Max Kind
  0 1567860  41  66895360  51  51      48 tuple
  1  189162   5  19690647  15  67     226 str
  2  948160  25  17261048  13  80      20 bzrlib._static_tuple_c.StaticTuple
  3       1   0  12583052   9  9012583052 dict
  4 1035483  27  12425796   9  99      12 int
  5       1   0         8   0 100       8 NoneType

(It is currently a bit clumsy that you have to do [o.address], but it means you can use large sets of ints. I'm still trying to sort that out.)
The memory consumption here looks more realistic. You can also see that just the tuple objects by themselves consume 67MB, or 51% of the memory. You can also see that for a dict holding 500k entries, we have 1.5M tuples. So we are using 3 tuples per key.

Note that we can't just use StaticTuple here, because index_memo[0] is the BTreeGraphIndex. Digging into the code, I think the data is all here:
```
 result[key] = (self._node_to_position(entry),
               None, parents, (method, None))
```
You can see that there is a whole lot of 'None' in this, and we also have an extra tuple at the end which is a bit of a waste (vs just inlining the content). We could save 28 bytes/record (or 28*500k = 14MB) by just inlining that last (method, None). Though it changes some apis.
Another thing to notice is that if you grep through the source code for uses of 'locations', you can see that we use the parents info and the index_memo, but we just ignore everything else. (method, compression_parent, and eol info are never interesting here). So really the result could be:
result[key] = (self._node_to_position(entry), parents)
This would be 28 + 4*2 = 36 vs (28+4*4 + 28+4*2) = 80, or saving 44b/record*.5M = 22MB. That is about 20% of that 122MB. Which isn't huge, but isn't a lot of effort to get. We could get a little better if we could collapse the node_to_position info along side the parents info, etc. (Say with a custom object.) That could shave another 28 bytes for the tuple, and maybe one extra reference.

I ended up working on this, because it was like a 10 minute thing. I ended up creating this class (code at lp:

class _GCBuildDetails(object):
   """A blob of data about the build details.

   This stores the minimal data, which then allows compatibility with the old
   api, without taking as much memory.
   """

   __slots__ = ('_index', '_group_start', '_group_end', '_basis_end',
                '_delta_end', '_parents')

   method = 'group'
   compression_parent = None

   def __init__(self, parents, position_info):
       self._parents = parents
       self._index = position_info[0]
       self._group_start = position_info[1]
       # Is this _end or length? Doesn't really matter to us
       self._group_end = position_info[2]
       self._basis_end = position_info[3]
       self._delta_end = position_info[4]

   def __repr__(self):
       return '%s(%s, %s)' % (self.__class__.__name__,
           self.index_memo, self._parents)

   @property
   def index_memo(self):
       return (self._index, self._group_start, self._group_end,
               self._basis_end, self._delta_end)

   @property
   def record_details(self):
       return static_tuple.StaticTuple(self.method, None)

   def __getitem__(self, offset):
       """Compatibility thunk to act like a tuple."""
       if offset == 0:
           return self.index_memo
       elif offset == 1:
           return self.compression_parent # Always None
       elif offset == 2:
           return self._parents
       elif offset == 3:
           return self.record_details
       else:
           raise IndexError('offset out of range')
         
   def __len__(self):
       return 4

The size of this class is 48 bytes, including the python object and gc overhead. This replaces the tuple(index_memo_tuple(index, start, end, start, end), None, parents, tuple(method, None)). Which is 28+4*4 + 28+4*5 + 28+4*2 = 128 bytes. So we save 80 bytes per record. on my bzr.dev repository that is ~10.6MB, on this dump it would be 40MB.

The other bit to look at is measuring real-world results. Which looks
something like this:

>>> from bzrlib import branch, trace, initialize; initialize().__enter__()

>>> b = branch.Branch.open('.')
>>> b.lock_read()
LogicalLockResult(/2.3-gc-build-details/)>)
>>> keys = b.repository.texts.keys()
>>> trace.debug_memory('holding all keys')
WorkingSize   33192KiB  PeakWorking   34772KiB  holding all keys
>>> locations = b.repository.texts._index.get_build_details(keys)
>>> trace.debug_memory('holding all keys')
WorkingSize   77604KiB  PeakWorking   87960KiB  holding all keys
>>>

Hopefully this has been informative. Digging into a bit of memory consumption, and how to determine where memory is being consumed, and a bit of understanding about how you can rework python objects to save a bit of memory (the biggest thing is to try to use fewer objects overall, since every object is at least 24 bytes, and that is if you are using __slots__. If you aren't then it is a minimum of 172 bytes (32 for the base object + 140 for its __dict__).

Monday, August 2, 2010

Meliae 0.3.0, statistics on subsets

Ah, yet another release. Hopefully with genuinely useful functionality.

In the process of inspecting yet another unexpected memory consumption, I came across a potential solution to the reference cycles problem.

Specifically, the issue is that often (at least in our codebases) you have coupled classes, that end up in a cycle, and you have trouble determining who "owns" what memory. In our case, the objects tend to be only 'loosely' coupled. In that one class passes off reference to a bound method to another object. However, a bound method holds a reference to the original object, so you get a cycle. (For example Repository passes its 'is_locked()' function down to the VersionedFiles so that they know whether it is safe to cache information. Repository "owns" the VersionedFiles, but they end up holding a reference back.)

What turned out to be useful was just adding an exclusion list to most operations. This ends up letting you find out about stuff that is referenced by object1, but is not referenced inside a specific subset.

One of the more interesting apis is the existing ObjManager.summarize().

So you can now do stuff like:

>>> om = loader.load('my.dump')
>>> om.summarize()
>>> om.summarize()
Total 5078730 objects, 290 types, Total size = 367.4MiB (385233882 bytes)
Index   Count   %      Size   % Cum     Max Kind
    0 2375950  46 224148214  58  58 4194313 str
    1   63209   1  77855404  20  78 3145868 dict
    2 1647097  32  29645488   7  86      20 bzrlib._static_tuple_c.StaticTuple
    3  374259   7  14852532   3  89     304 tuple
    4  138464   2  12387988   3  93     536 unicode
 ...

You can see that there is a lot of strings and dicts referenced here, but who owns them. Tracking into the references and using om.compute_total_size() just seems to get a lot of objects that reference everything. For example:

>>> dirstate = om.get_all('DirState')[0]
>>> om.summarize(dirstate)
Total 5025919 objects, 242 types, Total size = 362.0MiB (379541089 bytes)
Index   Count   %      Size   % Cum     Max Kind
    0 2355265  46 223321197  58  58 4194313 str
...

Now that did filter out a couple of objects, but when you track the graph, it turns out that DirState refers back to its WorkingTree, and WT has a Branch, which has the Repository, which has all the actual content. So what is actually referred to by just DirState?

>>> from pprint import pprint as pp
>>> pp(dirstate.refs_as_dict())
{'_bisect_page_size': 4096,
...
'_sha1_file': instancemethod(34050336 40B 3refs 1par),
'_sha1_provider': ContentFilterAwareSHA1Provider(41157008 172B 3refs 2par),
...
'crc_expected': -1471338016}
>>> pp(om[41157008].c)
[str(30677664 28B 265par 'tree'),
WorkingTree6(41157168 556B 35refs 7par),
type(39222976 452B 4refs 4par 'ContentFilterAwareSHA1Provider')]
>>> wt = om[41157168]
>>> om.summarize(dirstate, excluding=[wt.address])
Total 5025896 objects, 238 types, Total size = 362.0MiB (379539040 bytes)

Oops, I forgot an important step. Instances refer back to their type, and new-style classes keep an MRU reference all the way back to object which ends up referring to the whole dataset.

>>> om.remove_expensive_references()
removed 1906 expensive refs from 5078730 objs

Note that it doesn't take many references (just 2k out of 5M objects) to cause these problems.

>>> om.summarize(dirstate, excluding=[wt.address])
Total 699709 objects, 19 types, Total size = 42.2MiB (44239684 bytes)
Index   Count   %      Size   % Cum     Max Kind
    0  285690  40  20997620  47  47     226 str
    1  212977  30   8781420  19  67      48 tuple
    2   69640   9   8078240  18  85     116 set
...

And there you see that we have only 42MB that is directly referenced from DirState. (still more than I would like, but at least it is useful data, rather than just saying it references all objects).

I'm not 100% satisfied with the interface. Right now it takes an iterable of integer addresses. Which is often good because those integers are small and shared, so the only cost is the actual list. Taking objects requires creating the python proxy objects, which is something I'm avoiding because it actually requires a lot of memory to do so. (Analyzing 10M objects takes 1.1GB of peak ram, 780MB sustained.)

Tuesday, July 20, 2010

Meliae 0.2.1

Meliae 0.2.1 is now officially released.

The list of changes isn't great, it is mostly a bugfix release. There are a couple of quality-of-life changes.

For example you used to need to do:

>>> om = loader.load(filename)
>>> om.compute_parents()
>>> om.collapse_instance_dicts()

However, that is now done as part of loader.load(). This also goes along with a small bug fix to scanner.dump_all_objects() that makes sure to avoid dumping the gc.get_objects() list, since that is artifact from scanning, and not actually something you care about in the dump file.

Many thanks to Canonical for bringing me to Prague for the Launchpad Epic, giving me some time to work on stuff that isn't just Bazaar.

Saturday, January 9, 2010

Meliae 0.2.0

And here we are, with a new release of Meliae 0.2.0. This is a fairly major reworking of the internals, though it should be mostly compatible with 0.1.2. (The disk format did not change, most of the apis have deprecated thunks to help you migrate.)

The main difference is how data is stored in memory. Instead of using a Python dict + python objects, I know use a custom data collection. Python's generic objects are great for getting stuff going, but I was able to cut memory consumption in half with a custom object. This means that finally, analyzing a 600MB dump takes less than 600MB of memory (currently takes about ~300MB). Of course that also depends on your data structures (600MB dump that is one 500MB string will take up very little memory for analysis.)

The second biggest feature is hopefully a cleaner interface.

Call references 'parents' or 'children'. Indicating objects which point to me, and objects which I point to, respectively. 'ref_list' and 'referrers' was confusing. Both start with 'ref', so it takes a bit to sort them out.
Add attributes to get direct access to parents and children, rather than having to go back through the ObjManager.
Change the formatting strings to be more compact. No longer show the refs by default, since you can get to the objects anyway.

A third minor improvement is support for collapsing old-style classes (ones that don't inherit from 'object'.)

So how about an example. To start with, you need a way to interrupt your running process and get a dump of memory. I can't really give you much help, but you'll end up wanting:

from meliae import scanner
scanner.dump_all_objects('test-file.dump')

(This is the simplest method. There are others that take less memory while dumping, if overhead is a concern.)

Once you have that dump file, start up another python process and let's analyze it.

$ python
>>> from meliae import loader
>>> om = loader.load('test-file.dump')
loaded line 3579013, 3579014 objs, 377.4 / 377.4 MiB read in 79.6s

I recommend just always running these lines. If you used a different method of dumping, there are other things to do, which is why it isn't automatic (yet).

>>> om.compute_parents(); om.collapse_instance_dicts()
set parents  3579013 /  3579014
checked  3579013 /  3579014 collapsed   383480
set parents  3195533 /  3195534

Now we can look at the data, and get a feel for where our memory has gone:

>>> s  = om.summarize(); s
Total 3195534 objects, 418 types, Total size = 496.8MiB (520926557 bytes)
Index   Count   %      Size   % Cum     Max Kind
   0  189886   5 211153232  40  40    1112 Thread
   1  199117   6  72510520  13  5412583192 dict
   2  189892   5  65322848  12  66     344 _Condition
   3  380809  11  30464720   5  72      80 instancemethod
   4  397892  12  28673968   5  78    2080 tuple
   5  380694  11  27409968   5  83      72 builtin_function_or_method
   6  446606  13  26100905   5  88   14799 str
   7  189886   5  21267232   4  92     112 _socketobject
   8  197255   6  14568080   2  95   14688 list
...

At this point, you can see that there are 190k instances of Thread, which is consuming 40% of all memory. There is also a very large 12.5MB dict. (It turns out that this dict holds all of those Thread objects.)

But how do we determine that. One thing we can do is just get a handle to all of those Thread instances

>>> threads = om.get_all('Thread')
>>> threads[0]
Thread(32874448 1112B 23refs 3par)

So this thread is at address 32874448 (not particularly relevant), consumes 1112 bytes of memory (including its dict, since we collapsed threads), references 23 python objects, and is referenced by 3 python objects.

Lets see those references

>>> threads[0].c # shortcut for 'children'
[str(11409312 54B 189887par '_Thread__block'), _Condition(32903248 344B 11refs
 1par), str(11408976 53B 189887par '_Thread__name'), str(32862080 77B 1par 
'PoolThread-twisted.internet.reactor-1'), str(1...

It looks like there might be something interesting there, but it is a bit hard to sort out. Step one is to try using python's pprint utility.

>>> from pprint import pprint as pp
>>> pp(threads[0].c)
[str(11409312 54B 189887par '_Thread__block'),
 _Condition(32903248 344B 11refs 1par),
 str(11408976 53B 189887par '_Thread__name'),
 str(32862080 77B 1par 'PoolThread-twisted.internet.reactor-1'),
 str(11429168 57B 189887par '_Thread__daemonic'),
 bool(7478912 24B 572370par 'False'),
 str(11409200 56B 189887par '_Thread__started'),
 bool(7478944 24B 571496par 'True'),
...

That's a bit better, but I also now that instances have a dict, so lets try:

>>> pp(threads[0].refs_as_dict)
{'_Thread__args': tuple(140013759823952 56B 2008par),
 '_Thread__block': _Condition(32903248 344B 11refs 1par),
 '_Thread__daemonic': False,
 '_Thread__initialized': True,
 '_Thread__kwargs': dict(32516192 280B 1par),
 '_Thread__name': 'PoolThread-twisted.internet.reactor-1',
 '_Thread__started': True,
...

(Note to self, find a good way to shorten 'refs_as_dict', too much typing) Now that is starting to look like you can actually understand what is going on.

Another question to ask, who is referencing this object (why is it still active)?

>>> pp(threads[0].p)
[list(33599432 104B 1refs 1par),
 list(33649944 104B 1refs 1par),
 dict(11279168 1048B 10refs 1par)]

So this thread is in 2 lists and a dict with 10 items. So what about the parents of the parents

>>> pp(threads[0].p[0].p)
[ThreadPool(32888520 1120B 21refs 2par)]

So the first list is held by a ThreadPool. We can quick check info about that object:

>> pp(threads[0].p[0].p[0].refs_as_dict())
{'joined': False,
 'max': 10,
 'min': 0,
 'name': 'twisted.internet.reactor',
 'q': Queue(32888592 1120B 15refs 1par),
 'started': True,
 'threads': list(33599432 104B 1refs 1par),
 'waiters': list(33649944 104B 1refs 1par),
 'workers': 1,
 'working': list(33649656 72B 1par)}

So that seems to be a Twisted thread pool.
What about the other parents?

>>> pp(threads[0].p[1].p)
[ThreadPool(32888520 1120B 21refs 2par)]

Also a list held by a ThreadPool

>>> pp(threads[0].p[2].p)
[dict(11253824 3352B 98refs 70par)]

Hmmm, now we have a dict pointing to 98 objects which is, itself, referenced by 70 objects. This at least seems worth investigating further.

>>> d = threads[0].p[2].p[0]
>>> d
dict(11253824 3352B 98refs 70par)

Yep, that's the one. We can try to dump it as a dict

>>> pp(d.refs_as_dict())
>>> pp(d.refs_as_dict())
{'BoundedSemaphore': 'BoundedSemaphore',
 'Condition': 'Condition',
 'Event': 'Event',
 'Lock': builtin_function_or_method(10872592 72B 1refs 7par),
 'RLock': 'RLock',
 'Semaphore': 'Semaphore',
 'Thread': 'Thread',
...

Now, the formatting here is actually hiding something. Namely that the referred to object is actually a type:

>>> d.c[1]
type(11280288 880B 4refs 2par '_BoundedSemaphore')

From experience, I know that this is probably a module's dict. It has a lot of objects, and a lot of objects referencing it. (all functions references their modules global dict.) I'm still working on how to collapse a modules __dict__ into the module itself for clarity. Anyway, lets look at the parents to see what module this is.

>>> pp([p for p in d.p if p.type_str == 'module'])
[module(11411416 56B 1refs 18par 'threading')]

And there you go, the threading module.

And that's how you walk around the object graph. To finish analyzing this memory, I would probably poke at all the thread objects, and see what they are trying to accomplish. But mostly, the summary tells you that something is wrong. You shouldn't really be able to have 200k active threads doing real work. So probably you have something that is accidentally preserving threads that are no longer active.

Monday, November 16, 2009

Memory Debugging with Meliae

Background of Meliae 0.1.0

Earlier this year I started working on a new memory debugging program for python. I had originally tried to use heapy, but at the time it didn't support Windows, Mac, or 64-bit environments. (Which turned out to be all of my interesting platforms.) The other major problem is that I'm often debugging memory consumption of up to a GB of active data. While I think some of the former issues have been fixed, the latter is still a major issue for me.

So with the help of Michael Hudson, I started putting together a new structure. The code would be split into a scanner and a processor (loader). Such that you can interrupt a running process, dump the memory consumption to disk, and then analyze it in a separate process. (Often after the former has stopped.) The scanner can have a minimal memory profile, so even if your system is already swapping, you can dump out the memory info. (Robert Collins successfully dumped a 6GB memory profile, though analyzing that beast is still an issue.) The other advantage of this system, is that I don't have to play tricks with objects that represent the current state, like Guppy does with all sorts of crazy decorators.

In recent months, I've also focused on improving Bazaar's memory profile, which also meant improving memory profiling. Enough that I felt it was worth releasing the code. So officially Meliae 0.1.0 has been released. (For those wondering about the name, it is from Ash-Wood Nymph in Greek Mythology, aka it is just a fun name.)

Doing real work
So how does one actually use the program. Bazaar has a very nice ability, that you can use SIGQUIT (Ctrl+|) or SIGBREAK (Ctrl+Pause/Break) to drop into a debugger in the middle of a process to figure out what is going on. At that point, you can just:

from meliae import scanner
scanner.dump_all_objects('filename.json')

(There is an alternative scanner.dump_gc_objects() which has even lower memory profile, but will dump some objects more than once, creating a larger dump file.)

This creates a file describing all of the Python objects it was able to find along with their known size, references, and for some objects (strings, ints) their content. From there, you start another shell, and use:

>>> from meliae import loader
>>> om = loader.load('filename.json')
>>> s = om.summarize(); s

This dumps out something like:
Total 17916 objects, 96 types, Total size = 1.5MiB (1539583 bytes)
Index   Count   %      Size   % Cum     Max Kind
   0     701   3    546460  35  35   49292 dict
   1    7138  39    414639  26  62    4858 str
   2     208   1     94016   6  68     452 type
   3    1371   7     93228   6  74      68 code
   4    1431   7     85860   5  80      60 function
   5    1448   8     59808   3  84     280 tuple
   6     552   3     40760   2  86     684 list
   7      56   0     29152   1  88     596 StgDict
   8    2167  12     26004   1  90      12 int
   9     619   3     24760   1  91      40 wrapper_descriptor
  10     570   3     20520   1  93      36 builtin_function_or_method
  ...

Showing the top objects and what data they consume. This can often be revealing it itself. Do you have millions of tuples? One giant dict that is consuming a surprising amount of memory? (A dict with 200k entries is ~6MB on a 32-bit platform.)

There is more that can be done. You can run:

om.compute_referrers()

At this point, you can look at a single node, and find out what was referencing it. (So what was referencing that largest dict?)

>>> om[s.summaries[0].max_address]
MemObject(29351984, dict, 49292 bytes, 1578 refs [...], 1 referrers [26683840])

>>> om[26683840]
MemObject(29337264, function, format_string, 60 bytes, 6 refs...)

However, it also turns out that all 'classic' classes in Python indirect to their data via self.__dict__, which is a bit annoying to walk through. It also makes it looks like 'dict' is the #1 memory consumer, when actually it might be instances of Foo, which happen to use dicts. So you can use
om.collapse_instance_dicts()

Which will find all instances that seem to have trivial references to a __dict__, and then collapse it so that all references are directly from the instance, and all referenced objects then claim the instance as the referrer.

The above dump changes to:

>>> s = om.summarize(); s
Total 17701 objects, 96 types, Total size = 1.5MiB (1539583 bytes)
Index   Count   %      Size   % Cum     Max Kind
    0    7138  40    414639  26  26    4858 str
    1     486   2    394632  25  52   49292 dict
    2     208   1     94016   6  58     452 type
    3    1371   7     93228   6  64      68 code
    4    1431   8     85860   5  70      60 function
    5     149   0     82844   5  75     556 ReadLineTextBuffer
    6      93   0     65384   4  79    6312 module
    7    1448   8     59808   3  83     280 tuple
    8     552   3     40760   2  86     684 list
    9      56   0     29152   1  88     596 StgDict
   10    2167  12     26004   1  90      12 int

Which shows that ReadLineTextBuffer is actually a large consumer of memory.

Misc
There are other bits to explore, and improvements to be made. "scanner.get_recursive_size()" can be useful if you don't want to dump out a big file to analyze memory referenced from a given object (such as a cache). It doesn't give the whole picture, but can be useful in an interactive session.

Results
In the end, this code has enabled us to cut the memory consumption of Bazaar
roughly in half (for bzr branch). It also lets you see nice summaries
like this:

Total 2805995 objects, 276 types, Total size = 946.0MiB (991983819 bytes)
Index   Count   %      Size   % Cum     Max Kind
    0 1939090  69 916011611  92  92 5762600 str
    1    9449   0  33069868   3  95 3145868 dict
    2  132202   4  12506732   1  96     536 unicode
    3  383436  13   7048652   0  97      20 bzrlib._static_tuple_c.StaticTuple
    4  160027   5   5873744   0  98     304 tuple
    5    5429   0   5185252   0  98  412236 list
    6   62256   2   4482432   0  99      72 InventoryFile
    7     148   0   1334032   0  99 1048692 set
    8    2185   0   1214860   0  99     556 GroupCompressBlock
    9    8003   0    992372   0  99     124 CHKInventoryDirectory
...

(Note that after seeing this, we changed the code to not cache as many strings in memory, and I managed to decrease memory consumption to about 1/3rd it once was for this operation.)

The code isn't perfect, but being able to get a view of where memory is going, and what objects are holding on to it, is a huge improvement over just being in the dark.

Thursday, October 15, 2009

The Joys of multiple releases

I had originally written a longer post over at wordpress, only to have Firefox crash while trying to move an image, and WP doesn't do auto-saving like blogger. So now I'm back...

Bazaar 2.0.1 and 2.1.0b1 have now 'gone gold' in that I've uploaded the official tarballs, and asked people to make installers for them. Once installers are made, then we'll make the official announcement.

For those who haven't been following, Bazaar has now split its releases into 2 series. The 2.0.x series is based on 2.0.0 and has only bugfixes. Things that could cause compatibility problems (new features, removal of deprecated code, etc.) is only done in the 2.1.0.x series. We're hoping that this can give people some flexibility, as well as giving us more flexibility. In the past, we've suffered a bit trying to maintain backwards compatibility for some features/bugfixes, only to break compatibility for a big feature. Instead of suffering the worst of both, we're trying to get the best of both. If something needs to break compatibility, it just goes in the dev branch. Note that the development branch is still considered 'stable', in that the test suite always passes, and the code is pretty much always ready for a release. We just don't make the same guarantees about stable internal apis for 3rd parties to use.

The other change to the process is to stop doing as many "release candidate" builds. Instead, we will just cut a release. If there are problems, we'll cut the next release sooner. The chance for regressions in the 'bugfix-only' 2.0.x series should be low, and getting away from pre-builds means less overhead. We will still be doing releases we call 'rc1' before the next major stable release (2.1.0), and in that vein we expect to do little-to-no changes from the rc1 to the final build.

However, this new system does increase overhead for a single release. As now it is equivalent to doing the rc and the final in the same day. Also, because we now have 2 "integration" branches, it requires a bit more coordination between them.

For example, this is the revision graph for the recent 2.0.1 and 2.1.0b1 release

The basic workflow that I used was something like

Have a LOSA create 2 release branches lp:~bzr-pqm/bzr/2.0.1 and lp:~bzr-pqm/bzr/2.1.0b1
Create a local branch of each
Create another branch for doing my updates in, such as lp:~jameinel/bzr/2.0.1
Update 2.0.1 with a new version string
Update NEWS to clean it up, show that there is an official release, and provide a summary/overview of the changes.
Land this update into the official 2.0.1 branch via PQM. (Unfortunately this can take up to 2 hours depending on a bunch of different factors. We are trying to get this down to more like 10 min.)
Update my local copy from the final release. Tag it (bzr-2.0.1).
Create the tarball
Create the release launchpad
Upload the tarball to the release
While this is going on, go through the bugtracker and make sure that things mentioned in NEWS have the appropriate "Fix Released" state in the bug tracker, as well as being associated with the right milestones. With 34 bugfixes, this is a non-trivial undertaking.
Merge the 2.0.1 final release into the 2.1.0b1 branch. (All bugfixes in the stable series are candidates for merging at any time into the development series.)
Do lots of cleanup in NEWS. The main difficulty here is that bugfixes are present on 2 integration branches simultaneously, and those releases are slightly independent. We've talked about having the bugfix mentioned in both sections. Which would be more important if we ever make a development release without doing the corresponding stable release.
Do steps 4-10 again for 2.1.0b1.
While working or waiting on that, prepare lp:~bzr-pqm/bzr/2.0 since it is now going to be prepped for 2.0.2. This involves, bumping the version number, updating NEWS with blank entries for the next release (avoids some conflicts for people landing changes in that branch), and submitting all of that back to PQM.
When that has finished, bring the 2.0 stable branch back into bzr.dev. And prepare bzr.dev for 2.1.0b2. (version number bumps, NEWS cleanups, etc.)
In this case, cleaning up NEWS was again a bit of a chore. As now you have a file that should have a blank area for both the 2.1.0b2 changes, but also the 2.0.2 changes. Further, some of the changes that landed in bzr.dev in the mean-time, were not included in the 2.1.0b1 release. So you have to move them up into the new section. Getting NEWS right across 4 branches was quite a bit of work, and probably the hardest part (so far) of the process. Copy & Paste + bzr diff + bzr vimdiff we quite helpful here. Setting the news in bzr.dev to the exact copy from 'bzr-2.1.0b1' and then showing what was removed/added was a nice way to make sure to get everything .
breathe
Announce the tarballs, etc on the bzr mailing list, so that people can start preparing packages/installers.
I'm also the windows installer packager, so I get to build around 8 installers... (standalone installers, and 3 python [2.4, 2.5, 2.6] installers.) Most of this is scripted, but often something breaks along the way.
Update Pypi, freshmeat, ... with updates for the new versions. Twice. (At least here we did not update for 'rc' versions. So this work is strictly doubled.)

Overall, it is a fair amount of work. I think it will amount to 1-2 days of work time (spread out over 3+ days of real time). With any luck, it will amount to being 'more concentrated, but less often'. I would say that we'd get more practice, but we also try to rotate release managers. Both to spread the knowledge, and to avoid burnout. (Though Martin has been doing the last 4-or-so releases...)

So here's to everyone upgrading to their preferred release (in about a week's time).

Thursday, October 8, 2009

Refactoring work for review (and keep your annotations)

Tim Penhey recently had a nice post about how he split up his changes to make it easier to review. His method used 'bzr pipeline' and some combinations of shelving, merging, and reverting the merges.

However, while I wanted to refactor my changes to make it easier to review, I didn't want to lose my annotation history. So I took a different approach.

To start, with, I'll assume you have a single branch with lots of changes, each wrt different features. You developed them 'concurrently' (jumping back and forth between features, without actually committing it to a different branch). And now that you are done, you want to split them out again.

There are a lot of possible ways that you can do this. With some proponents prefering a 'rebase' style. Where you replay the commits you made in a new order, possibly squashing them, etc. I'm personally not a big fan of that.
Tim's is another method, where you just cherrypick the changes into new branches, and use something like bzr-pipeline to manage the layering. However in reading his workflow, he would also lose the history of the individual changes.

So this is my workflow.

Start with a branch that has a whole lot of changes on it, and is essentially 'done'. We'll call this branch "dogpile".
Create a new branch from it (bzr branch --switch ../dogpile ../feature1), and remove all of the changes but the 'first step'. I personally did that with "bzr revert -r submit: file1 file2 file3" but left "file4" alone.
"bzr commit" in that branch. The delta for that revision will show a lot of your hard-worked on changes being removed. However "bzr diff -r submit:" should show a very nice clean patch that only includes the changes for "feature1".
Go back to the original dogpile branch, and create a new "feature2" branch. (bzr branch --switch ../dogpile ../feature2)
Now merge the "feature1" branch (bzr merge ../feature1). At this point, it looks like everything has been removed except for the bits for feature1. However, just using "bzr revert file2..." we can restore the changes for "feature2".
You can track your progress in a few ways. "bzr diff -r submit:" will show you the combine differences from feature1 and feature2. "bzr diff -r -1:../feature1" will show you just the differences between the current feature2 branch and the feature1 branch. The latter is what you want to be cleaning up, so that it includes all of your feature2 changes, built on top of your feature1 changes. You also have the opportunity to tweak the code a bit, and run the test suite to make sure things are working correctly.
"bzr commit" in that branch. At this point, the diff from upstream to feature1 should be clean, and the diff from feature1 => feature2 should be clean. As an added benefit, doing "bzr annotate file2" will preserve all the hard-won history of the file.
repeat steps 4-7 for all the other features you wanted to split out into their own branches.

When you are done, you will have N feature branches, split up from the original "dogpile" branch. By using the "merge + revert things back into existence" trick, you can preserve all of the annotations for your files. This works because you have 2 sources that the file content could come from. One source is the "dogpile" branch, and the other source is a branch where "dogpile" changes were removed. Since the changes are present in one of the parents, the annotations are brought from there.

This is what the 'qlog' of my refactoring looks like.

The actual content changes (the little grey dots) actually span about 83 commits. However, you can see that I split that up into 6 new branches (some more independent than others), all of which generate a neat difference to their parent, and preserve all of the annotation information from the full history. You can also see that now that I have it split out, I can do simple changes to each branch (notice that purple has an extra commit). This will most likely come into play if people ask for any changes during review.