John Arbash Meinel's Bazaar Blog: 2010

Wednesday, August 4, 2010

Step-by-step Meliae

Some people asked me to provide a step-by-step guide to how to debug memory using Meliae. I just ran into another unknown situation, so I figured I'd post a step-by-step along with rationale for why I'm doing it.

First is loading up the data. This was a dump while running 'bzr pack' of a large repository.
```
>>> from meliae import loader
>>> om = loader.load('big.dump')
>>> om.remove_expensive_references()
```
The last step is done because otherwise instances keep a reference to their class, and classes reference their base types, and you end up getting to 'object', and somewhere along the way you end up referencing too much. I don't do it automatically, because it does remove actual references, which someone might want to keep.

Then, do a big summary, just to get started

>>> om.summarize()
Total 8364538 objects, 286 types, Total size = 440.4MiB (461765737 bytes)
Index   Count   %      Size   % Cum     Max Kind
0 2193778  26 181553569  39  39 4194281 str
1   12519   0  97231956  21  6012583052 dict
2 1599439  19  68293428  14  75     304 tuple
3 3459765  41  62169616  13  88      20 bzrlib._static_tuple_c.StaticTuple
4      82   0  29372712   6  94 8388724 set
5 1052573  12  12630876   2  97      12 int
6    1644   0   4693700   1  98 2351848 list
7    4038   0   2245128   0  99     556 _LazyGroupCompressFactory

You can see that
1. There are 8M objects, and about 440MB of reachable memory.
2. The vast bulk of that is in strings, but there are also some oddities, like that 12.5MB dictionary
At this point, I wanted to understand what was up with that big dictionary.
```
>>> dicts = om.get_all('dict')
>>> dicts[0]
dict(417338688 12583052B 1045240refs 2par)
```
om.get_all() gives you a list of all objects matching the given type string. It also sorts the returned list, so that the biggest items are at the
beginning.

Now lets look around a bit, to try to figure out where this dict lives

>>> bigd = dicts[0]
>>> from pprint import pprint as pp
We'll use pprint a lot, so map it to something easy to type.
>>> pp(bigd.p)
[frame(39600120 464B 23refs 1par '_get_remaining_record_stream'),
_BatchingBlockFetcher(180042960 556B 17refs 3par)]

So this dict is contained in a frame, but also an attribute of _BatchingBlockFetcher. Let's try to see which attribute it is.

>>> pp(bigd.p[1].refs_as_dict())
{'batch_memos': dict(584888016 140B 4refs 1par),
'gcvf': GroupCompressVersionedFiles(571002736 556B 13refs 9par),
'keys': list(186984208 16968B 4038refs 2par),
'last_read_memo': tuple(536280880 40B 3refs 1par),
'locations': dict(417338688 12583052B 1045240refs 2par),
'manager': _LazyGroupContentManager(584077552 172B 7refs 3716par),
'memos_to_get': list(186983248 52B 1refs 2par),
'total_bytes': 774119}

It takes a bit to look through that, but you can see:
```
'locations': dict(417338688 12583052B 1045240refs 2par)
```
Note that 1045240refs means there are 522k key:value pairs in this dict.

How much total memory is this dict referencing?

>>> om.summarize(bigd)
Total 4035636 objects, 22 types, Total size = 136.8MiB (143461221 bytes)
Index   Count   %      Size   % Cum     Max Kind
0 1567864  38  66895512  46  46      52 tuple
1  285704   7  24972909  17  64     226 str
2 1142424  28  20757800  14  78      20 bzrlib._static_tuple_c.StaticTuple
...
8       2   0      1832   0  99    1684 FIFOCache
9      35   0      1120   0  99      32 _InternalNode

So about 136MB out of 440MB is reachable from this dict. However, I'm noticing that FIFOCache and _InternalNode is also reachable, and those don't really seem to fit. I also notice that there are 1.6M tuples here, which is often a no-no. (If we are going to have that many tuples, we probably want them to be StaticTuple() because they use a fair amount less memory, can be interned, and aren't in the garbage collector. So lets poke around a little bit
```
>>> bigd[0]
bzrlib._static_tuple_c.StaticTuple(408433296 20B 2refs 9par)
>>> bigd[1]
tuple(618390272 44B 4refs 1par)
>>> pp(bigd[0].c)
[str(40127328 80B 473par 'svn-v4:138bc75d-0d04-0410-961f-82ee72b054a4:trunk:126948'),
str(247098672 85B 37par '14@138bc75d-0d04-0410-961f-82ee72b054a4:trunk%2Fgcc%2Finput.h')]
>>> pp(bigd[1].c)
[tuple(618383880 36B 2refs 1par),
bzrlib._static_tuple_c.StaticTuple(569848240 16B 1refs 3par),
NoneType(505223636 8B 1074389par),
tuple(618390416 48B 5refs 1par)]
```
One thing to note, dict references are [key1, value1, key2, value2] while tuple references are (last, middle, first). I don't know why tuple.tp_traverse traverses in reverse order, but it does. And StaticTuple followed its lead.
The things to take away from this is
1. It is mapping a StaticTuple(file_id, revision_id) => tuple()
2. The target tuple is actually quite complex, so we'll have to dig a bit deeper to figure it out.
3. The file-id and revision-id are both referenced many times (37 and 473 respectively), so we seem to be doing a decent job sharing those strings.
At this point, I would probably pull up the source code for _BatchingBlockFetcher, to try and figure out what is so big for locations. Looking at the source code, it is actually built in _get_remaining_record_stream as:
```
locations = self._index.get_build_details(keys)
```
This is then defined as returning:
```
  :return: A dict of key: (index_memo, compression_parent, parents, record_details).
```

And the index memo contains a reference to the indexes themselves, but they don't really 'own' them. So lets filter them out:

>>> indexes = om.get_all('BTreeGraphIndex')
>>> om.summarize(bigd, excluding=[o.address for o in indexes])
Total 3740667 objects, 6 types, Total size = 122.9MiB (128855911 bytes)
Index   Count   %      Size   % Cum     Max Kind
  0 1567860  41  66895360  51  51      48 tuple
  1  189162   5  19690647  15  67     226 str
  2  948160  25  17261048  13  80      20 bzrlib._static_tuple_c.StaticTuple
  3       1   0  12583052   9  9012583052 dict
  4 1035483  27  12425796   9  99      12 int
  5       1   0         8   0 100       8 NoneType

(It is currently a bit clumsy that you have to do [o.address], but it means you can use large sets of ints. I'm still trying to sort that out.)
The memory consumption here looks more realistic. You can also see that just the tuple objects by themselves consume 67MB, or 51% of the memory. You can also see that for a dict holding 500k entries, we have 1.5M tuples. So we are using 3 tuples per key.

Note that we can't just use StaticTuple here, because index_memo[0] is the BTreeGraphIndex. Digging into the code, I think the data is all here:
```
 result[key] = (self._node_to_position(entry),
               None, parents, (method, None))
```
You can see that there is a whole lot of 'None' in this, and we also have an extra tuple at the end which is a bit of a waste (vs just inlining the content). We could save 28 bytes/record (or 28*500k = 14MB) by just inlining that last (method, None). Though it changes some apis.
Another thing to notice is that if you grep through the source code for uses of 'locations', you can see that we use the parents info and the index_memo, but we just ignore everything else. (method, compression_parent, and eol info are never interesting here). So really the result could be:
result[key] = (self._node_to_position(entry), parents)
This would be 28 + 4*2 = 36 vs (28+4*4 + 28+4*2) = 80, or saving 44b/record*.5M = 22MB. That is about 20% of that 122MB. Which isn't huge, but isn't a lot of effort to get. We could get a little better if we could collapse the node_to_position info along side the parents info, etc. (Say with a custom object.) That could shave another 28 bytes for the tuple, and maybe one extra reference.

I ended up working on this, because it was like a 10 minute thing. I ended up creating this class (code at lp:

class _GCBuildDetails(object):
   """A blob of data about the build details.

   This stores the minimal data, which then allows compatibility with the old
   api, without taking as much memory.
   """

   __slots__ = ('_index', '_group_start', '_group_end', '_basis_end',
                '_delta_end', '_parents')

   method = 'group'
   compression_parent = None

   def __init__(self, parents, position_info):
       self._parents = parents
       self._index = position_info[0]
       self._group_start = position_info[1]
       # Is this _end or length? Doesn't really matter to us
       self._group_end = position_info[2]
       self._basis_end = position_info[3]
       self._delta_end = position_info[4]

   def __repr__(self):
       return '%s(%s, %s)' % (self.__class__.__name__,
           self.index_memo, self._parents)

   @property
   def index_memo(self):
       return (self._index, self._group_start, self._group_end,
               self._basis_end, self._delta_end)

   @property
   def record_details(self):
       return static_tuple.StaticTuple(self.method, None)

   def __getitem__(self, offset):
       """Compatibility thunk to act like a tuple."""
       if offset == 0:
           return self.index_memo
       elif offset == 1:
           return self.compression_parent # Always None
       elif offset == 2:
           return self._parents
       elif offset == 3:
           return self.record_details
       else:
           raise IndexError('offset out of range')
         
   def __len__(self):
       return 4

The size of this class is 48 bytes, including the python object and gc overhead. This replaces the tuple(index_memo_tuple(index, start, end, start, end), None, parents, tuple(method, None)). Which is 28+4*4 + 28+4*5 + 28+4*2 = 128 bytes. So we save 80 bytes per record. on my bzr.dev repository that is ~10.6MB, on this dump it would be 40MB.

The other bit to look at is measuring real-world results. Which looks
something like this:

>>> from bzrlib import branch, trace, initialize; initialize().__enter__()

>>> b = branch.Branch.open('.')
>>> b.lock_read()
LogicalLockResult(/2.3-gc-build-details/)>)
>>> keys = b.repository.texts.keys()
>>> trace.debug_memory('holding all keys')
WorkingSize   33192KiB  PeakWorking   34772KiB  holding all keys
>>> locations = b.repository.texts._index.get_build_details(keys)
>>> trace.debug_memory('holding all keys')
WorkingSize   77604KiB  PeakWorking   87960KiB  holding all keys
>>>

Hopefully this has been informative. Digging into a bit of memory consumption, and how to determine where memory is being consumed, and a bit of understanding about how you can rework python objects to save a bit of memory (the biggest thing is to try to use fewer objects overall, since every object is at least 24 bytes, and that is if you are using __slots__. If you aren't then it is a minimum of 172 bytes (32 for the base object + 140 for its __dict__).

Monday, August 2, 2010

Meliae 0.3.0, statistics on subsets

Ah, yet another release. Hopefully with genuinely useful functionality.

In the process of inspecting yet another unexpected memory consumption, I came across a potential solution to the reference cycles problem.

Specifically, the issue is that often (at least in our codebases) you have coupled classes, that end up in a cycle, and you have trouble determining who "owns" what memory. In our case, the objects tend to be only 'loosely' coupled. In that one class passes off reference to a bound method to another object. However, a bound method holds a reference to the original object, so you get a cycle. (For example Repository passes its 'is_locked()' function down to the VersionedFiles so that they know whether it is safe to cache information. Repository "owns" the VersionedFiles, but they end up holding a reference back.)

What turned out to be useful was just adding an exclusion list to most operations. This ends up letting you find out about stuff that is referenced by object1, but is not referenced inside a specific subset.

One of the more interesting apis is the existing ObjManager.summarize().

So you can now do stuff like:

>>> om = loader.load('my.dump')
>>> om.summarize()
>>> om.summarize()
Total 5078730 objects, 290 types, Total size = 367.4MiB (385233882 bytes)
Index   Count   %      Size   % Cum     Max Kind
    0 2375950  46 224148214  58  58 4194313 str
    1   63209   1  77855404  20  78 3145868 dict
    2 1647097  32  29645488   7  86      20 bzrlib._static_tuple_c.StaticTuple
    3  374259   7  14852532   3  89     304 tuple
    4  138464   2  12387988   3  93     536 unicode
 ...

You can see that there is a lot of strings and dicts referenced here, but who owns them. Tracking into the references and using om.compute_total_size() just seems to get a lot of objects that reference everything. For example:

>>> dirstate = om.get_all('DirState')[0]
>>> om.summarize(dirstate)
Total 5025919 objects, 242 types, Total size = 362.0MiB (379541089 bytes)
Index   Count   %      Size   % Cum     Max Kind
    0 2355265  46 223321197  58  58 4194313 str
...

Now that did filter out a couple of objects, but when you track the graph, it turns out that DirState refers back to its WorkingTree, and WT has a Branch, which has the Repository, which has all the actual content. So what is actually referred to by just DirState?

>>> from pprint import pprint as pp
>>> pp(dirstate.refs_as_dict())
{'_bisect_page_size': 4096,
...
'_sha1_file': instancemethod(34050336 40B 3refs 1par),
'_sha1_provider': ContentFilterAwareSHA1Provider(41157008 172B 3refs 2par),
...
'crc_expected': -1471338016}
>>> pp(om[41157008].c)
[str(30677664 28B 265par 'tree'),
WorkingTree6(41157168 556B 35refs 7par),
type(39222976 452B 4refs 4par 'ContentFilterAwareSHA1Provider')]
>>> wt = om[41157168]
>>> om.summarize(dirstate, excluding=[wt.address])
Total 5025896 objects, 238 types, Total size = 362.0MiB (379539040 bytes)

Oops, I forgot an important step. Instances refer back to their type, and new-style classes keep an MRU reference all the way back to object which ends up referring to the whole dataset.

>>> om.remove_expensive_references()
removed 1906 expensive refs from 5078730 objs

Note that it doesn't take many references (just 2k out of 5M objects) to cause these problems.

>>> om.summarize(dirstate, excluding=[wt.address])
Total 699709 objects, 19 types, Total size = 42.2MiB (44239684 bytes)
Index   Count   %      Size   % Cum     Max Kind
    0  285690  40  20997620  47  47     226 str
    1  212977  30   8781420  19  67      48 tuple
    2   69640   9   8078240  18  85     116 set
...

And there you see that we have only 42MB that is directly referenced from DirState. (still more than I would like, but at least it is useful data, rather than just saying it references all objects).

I'm not 100% satisfied with the interface. Right now it takes an iterable of integer addresses. Which is often good because those integers are small and shared, so the only cost is the actual list. Taking objects requires creating the python proxy objects, which is something I'm avoiding because it actually requires a lot of memory to do so. (Analyzing 10M objects takes 1.1GB of peak ram, 780MB sustained.)

Tuesday, July 20, 2010

Meliae 0.2.1

Meliae 0.2.1 is now officially released.

The list of changes isn't great, it is mostly a bugfix release. There are a couple of quality-of-life changes.

For example you used to need to do:

>>> om = loader.load(filename)
>>> om.compute_parents()
>>> om.collapse_instance_dicts()

However, that is now done as part of loader.load(). This also goes along with a small bug fix to scanner.dump_all_objects() that makes sure to avoid dumping the gc.get_objects() list, since that is artifact from scanning, and not actually something you care about in the dump file.

Many thanks to Canonical for bringing me to Prague for the Launchpad Epic, giving me some time to work on stuff that isn't just Bazaar.

Saturday, January 9, 2010

Meliae 0.2.0

And here we are, with a new release of Meliae 0.2.0. This is a fairly major reworking of the internals, though it should be mostly compatible with 0.1.2. (The disk format did not change, most of the apis have deprecated thunks to help you migrate.)

The main difference is how data is stored in memory. Instead of using a Python dict + python objects, I know use a custom data collection. Python's generic objects are great for getting stuff going, but I was able to cut memory consumption in half with a custom object. This means that finally, analyzing a 600MB dump takes less than 600MB of memory (currently takes about ~300MB). Of course that also depends on your data structures (600MB dump that is one 500MB string will take up very little memory for analysis.)

The second biggest feature is hopefully a cleaner interface.

Call references 'parents' or 'children'. Indicating objects which point to me, and objects which I point to, respectively. 'ref_list' and 'referrers' was confusing. Both start with 'ref', so it takes a bit to sort them out.
Add attributes to get direct access to parents and children, rather than having to go back through the ObjManager.
Change the formatting strings to be more compact. No longer show the refs by default, since you can get to the objects anyway.

A third minor improvement is support for collapsing old-style classes (ones that don't inherit from 'object'.)

So how about an example. To start with, you need a way to interrupt your running process and get a dump of memory. I can't really give you much help, but you'll end up wanting:

from meliae import scanner
scanner.dump_all_objects('test-file.dump')

(This is the simplest method. There are others that take less memory while dumping, if overhead is a concern.)

Once you have that dump file, start up another python process and let's analyze it.

$ python
>>> from meliae import loader
>>> om = loader.load('test-file.dump')
loaded line 3579013, 3579014 objs, 377.4 / 377.4 MiB read in 79.6s

I recommend just always running these lines. If you used a different method of dumping, there are other things to do, which is why it isn't automatic (yet).

>>> om.compute_parents(); om.collapse_instance_dicts()
set parents  3579013 /  3579014
checked  3579013 /  3579014 collapsed   383480
set parents  3195533 /  3195534

Now we can look at the data, and get a feel for where our memory has gone:

>>> s  = om.summarize(); s
Total 3195534 objects, 418 types, Total size = 496.8MiB (520926557 bytes)
Index   Count   %      Size   % Cum     Max Kind
   0  189886   5 211153232  40  40    1112 Thread
   1  199117   6  72510520  13  5412583192 dict
   2  189892   5  65322848  12  66     344 _Condition
   3  380809  11  30464720   5  72      80 instancemethod
   4  397892  12  28673968   5  78    2080 tuple
   5  380694  11  27409968   5  83      72 builtin_function_or_method
   6  446606  13  26100905   5  88   14799 str
   7  189886   5  21267232   4  92     112 _socketobject
   8  197255   6  14568080   2  95   14688 list
...

At this point, you can see that there are 190k instances of Thread, which is consuming 40% of all memory. There is also a very large 12.5MB dict. (It turns out that this dict holds all of those Thread objects.)

But how do we determine that. One thing we can do is just get a handle to all of those Thread instances

>>> threads = om.get_all('Thread')
>>> threads[0]
Thread(32874448 1112B 23refs 3par)

So this thread is at address 32874448 (not particularly relevant), consumes 1112 bytes of memory (including its dict, since we collapsed threads), references 23 python objects, and is referenced by 3 python objects.

Lets see those references

>>> threads[0].c # shortcut for 'children'
[str(11409312 54B 189887par '_Thread__block'), _Condition(32903248 344B 11refs
 1par), str(11408976 53B 189887par '_Thread__name'), str(32862080 77B 1par 
'PoolThread-twisted.internet.reactor-1'), str(1...

It looks like there might be something interesting there, but it is a bit hard to sort out. Step one is to try using python's pprint utility.

>>> from pprint import pprint as pp
>>> pp(threads[0].c)
[str(11409312 54B 189887par '_Thread__block'),
 _Condition(32903248 344B 11refs 1par),
 str(11408976 53B 189887par '_Thread__name'),
 str(32862080 77B 1par 'PoolThread-twisted.internet.reactor-1'),
 str(11429168 57B 189887par '_Thread__daemonic'),
 bool(7478912 24B 572370par 'False'),
 str(11409200 56B 189887par '_Thread__started'),
 bool(7478944 24B 571496par 'True'),
...

That's a bit better, but I also now that instances have a dict, so lets try:

>>> pp(threads[0].refs_as_dict)
{'_Thread__args': tuple(140013759823952 56B 2008par),
 '_Thread__block': _Condition(32903248 344B 11refs 1par),
 '_Thread__daemonic': False,
 '_Thread__initialized': True,
 '_Thread__kwargs': dict(32516192 280B 1par),
 '_Thread__name': 'PoolThread-twisted.internet.reactor-1',
 '_Thread__started': True,
...

(Note to self, find a good way to shorten 'refs_as_dict', too much typing) Now that is starting to look like you can actually understand what is going on.

Another question to ask, who is referencing this object (why is it still active)?

>>> pp(threads[0].p)
[list(33599432 104B 1refs 1par),
 list(33649944 104B 1refs 1par),
 dict(11279168 1048B 10refs 1par)]

So this thread is in 2 lists and a dict with 10 items. So what about the parents of the parents

>>> pp(threads[0].p[0].p)
[ThreadPool(32888520 1120B 21refs 2par)]

So the first list is held by a ThreadPool. We can quick check info about that object:

>> pp(threads[0].p[0].p[0].refs_as_dict())
{'joined': False,
 'max': 10,
 'min': 0,
 'name': 'twisted.internet.reactor',
 'q': Queue(32888592 1120B 15refs 1par),
 'started': True,
 'threads': list(33599432 104B 1refs 1par),
 'waiters': list(33649944 104B 1refs 1par),
 'workers': 1,
 'working': list(33649656 72B 1par)}

So that seems to be a Twisted thread pool.
What about the other parents?

>>> pp(threads[0].p[1].p)
[ThreadPool(32888520 1120B 21refs 2par)]

Also a list held by a ThreadPool

>>> pp(threads[0].p[2].p)
[dict(11253824 3352B 98refs 70par)]

Hmmm, now we have a dict pointing to 98 objects which is, itself, referenced by 70 objects. This at least seems worth investigating further.

>>> d = threads[0].p[2].p[0]
>>> d
dict(11253824 3352B 98refs 70par)

Yep, that's the one. We can try to dump it as a dict

>>> pp(d.refs_as_dict())
>>> pp(d.refs_as_dict())
{'BoundedSemaphore': 'BoundedSemaphore',
 'Condition': 'Condition',
 'Event': 'Event',
 'Lock': builtin_function_or_method(10872592 72B 1refs 7par),
 'RLock': 'RLock',
 'Semaphore': 'Semaphore',
 'Thread': 'Thread',
...

Now, the formatting here is actually hiding something. Namely that the referred to object is actually a type:

>>> d.c[1]
type(11280288 880B 4refs 2par '_BoundedSemaphore')

From experience, I know that this is probably a module's dict. It has a lot of objects, and a lot of objects referencing it. (all functions references their modules global dict.) I'm still working on how to collapse a modules __dict__ into the module itself for clarity. Anyway, lets look at the parents to see what module this is.

>>> pp([p for p in d.p if p.type_str == 'module'])
[module(11411416 56B 1refs 18par 'threading')]

And there you go, the threading module.

And that's how you walk around the object graph. To finish analyzing this memory, I would probably poke at all the thread objects, and see what they are trying to accomplish. But mostly, the summary tells you that something is wrong. You shouldn't really be able to have 200k active threads doing real work. So probably you have something that is accidentally preserving threads that are no longer active.

John Arbash Meinel's Bazaar Blog

Blog Archive

Contributors