Wednesday, August 4, 2010

Step-by-step Meliae

Some people asked me to provide a step-by-step guide to how to debug memory using Meliae. I just ran into another unknown situation, so I figured I'd post a step-by-step along with rationale for why I'm doing it.
  1. First is loading up the data. This was a dump while running 'bzr pack' of a large repository.
    >>> from meliae import loader
    >>> om = loader.load('big.dump')
    >>> om.remove_expensive_references()
    The last step is done because otherwise instances keep a reference to their class, and classes reference their base types, and you end up getting to 'object', and somewhere along the way you end up referencing too much. I don't do it automatically, because it does remove actual references, which someone might want to keep.
  2. Then, do a big summary, just to get started
    >>> om.summarize()
    Total 8364538 objects, 286 types, Total size = 440.4MiB (461765737 bytes)
    Index Count % Size % Cum Max Kind
    0 2193778 26 181553569 39 39 4194281 str
    1 12519 0 97231956 21 6012583052 dict
    2 1599439 19 68293428 14 75 304 tuple
    3 3459765 41 62169616 13 88 20 bzrlib._static_tuple_c.StaticTuple
    4 82 0 29372712 6 94 8388724 set
    5 1052573 12 12630876 2 97 12 int
    6 1644 0 4693700 1 98 2351848 list
    7 4038 0 2245128 0 99 556 _LazyGroupCompressFactory
  3. You can see that
    1. There are 8M objects, and about 440MB of reachable memory.
    2. The vast bulk of that is in strings, but there are also some oddities, like that 12.5MB dictionary
  4. At this point, I wanted to understand what was up with that big dictionary.
    >>> dicts = om.get_all('dict')
    >>> dicts[0]
    dict(417338688 12583052B 1045240refs 2par)
    om.get_all() gives you a list of all objects matching the given type string. It also sorts the returned list, so that the biggest items are at the
  5. Now lets look around a bit, to try to figure out where this dict lives
    >>> bigd = dicts[0]
    >>> from pprint import pprint as pp
    We'll use pprint a lot, so map it to something easy to type.
    >>> pp(bigd.p)
    [frame(39600120 464B 23refs 1par '_get_remaining_record_stream'),
    _BatchingBlockFetcher(180042960 556B 17refs 3par)]
  6. So this dict is contained in a frame, but also an attribute of _BatchingBlockFetcher. Let's try to see which attribute it is.
    >>> pp(bigd.p[1].refs_as_dict())
    {'batch_memos': dict(584888016 140B 4refs 1par),
    'gcvf': GroupCompressVersionedFiles(571002736 556B 13refs 9par),
    'keys': list(186984208 16968B 4038refs 2par),
    'last_read_memo': tuple(536280880 40B 3refs 1par),
    'locations': dict(417338688 12583052B 1045240refs 2par),
    'manager': _LazyGroupContentManager(584077552 172B 7refs 3716par),
    'memos_to_get': list(186983248 52B 1refs 2par),
    'total_bytes': 774119}
  7. It takes a bit to look through that, but you can see:
    'locations': dict(417338688 12583052B 1045240refs 2par)
    Note that 1045240refs means there are 522k key:value pairs in this dict.
  8. How much total memory is this dict referencing?
    >>> om.summarize(bigd)
    Total 4035636 objects, 22 types, Total size = 136.8MiB (143461221 bytes)
    Index Count % Size % Cum Max Kind
    0 1567864 38 66895512 46 46 52 tuple
    1 285704 7 24972909 17 64 226 str
    2 1142424 28 20757800 14 78 20 bzrlib._static_tuple_c.StaticTuple
    8 2 0 1832 0 99 1684 FIFOCache
    9 35 0 1120 0 99 32 _InternalNode
  9. So about 136MB out of 440MB is reachable from this dict. However, I'm noticing that FIFOCache and _InternalNode is also reachable, and those don't really seem to fit. I also notice that there are 1.6M tuples here, which is often a no-no. (If we are going to have that many tuples, we probably want them to be StaticTuple() because they use a fair amount less memory, can be interned, and aren't in the garbage collector. So lets poke around a little bit
    >>> bigd[0]
    bzrlib._static_tuple_c.StaticTuple(408433296 20B 2refs 9par)
    >>> bigd[1]
    tuple(618390272 44B 4refs 1par)
    >>> pp(bigd[0].c)
    [str(40127328 80B 473par 'svn-v4:138bc75d-0d04-0410-961f-82ee72b054a4:trunk:126948'),
    str(247098672 85B 37par '14@138bc75d-0d04-0410-961f-82ee72b054a4:trunk%2Fgcc%2Finput.h')]
    >>> pp(bigd[1].c)
    [tuple(618383880 36B 2refs 1par),
    bzrlib._static_tuple_c.StaticTuple(569848240 16B 1refs 3par),
    NoneType(505223636 8B 1074389par),
    tuple(618390416 48B 5refs 1par)]
    One thing to note, dict references are [key1, value1, key2, value2] while tuple references are (last, middle, first). I don't know why tuple.tp_traverse traverses in reverse order, but it does. And StaticTuple followed its lead.
    The things to take away from this is
    1. It is mapping a StaticTuple(file_id, revision_id) => tuple()
    2. The target tuple is actually quite complex, so we'll have to dig a bit deeper to figure it out.
    3. The file-id and revision-id are both referenced many times (37 and 473 respectively), so we seem to be doing a decent job sharing those strings.
  10. At this point, I would probably pull up the source code for _BatchingBlockFetcher, to try and figure out what is so big for locations. Looking at the source code, it is actually built in _get_remaining_record_stream as:
    locations = self._index.get_build_details(keys)
    This is then defined as returning:
      :return: A dict of key: (index_memo, compression_parent, parents, record_details).
  11. And the index memo contains a reference to the indexes themselves, but they don't really 'own' them. So lets filter them out:
    >>> indexes = om.get_all('BTreeGraphIndex')
    >>> om.summarize(bigd, excluding=[o.address for o in indexes])
    Total 3740667 objects, 6 types, Total size = 122.9MiB (128855911 bytes)
    Index Count % Size % Cum Max Kind
    0 1567860 41 66895360 51 51 48 tuple
    1 189162 5 19690647 15 67 226 str
    2 948160 25 17261048 13 80 20 bzrlib._static_tuple_c.StaticTuple
    3 1 0 12583052 9 9012583052 dict
    4 1035483 27 12425796 9 99 12 int
    5 1 0 8 0 100 8 NoneType
    (It is currently a bit clumsy that you have to do [o.address], but it means you can use large sets of ints. I'm still trying to sort that out.)
    The memory consumption here looks more realistic. You can also see that just the tuple objects by themselves consume 67MB, or 51% of the memory. You can also see that for a dict holding 500k entries, we have 1.5M tuples. So we are using 3 tuples per key.
  12. Note that we can't just use StaticTuple here, because index_memo[0] is the BTreeGraphIndex. Digging into the code, I think the data is all here:
     result[key] = (self._node_to_position(entry),
    None, parents, (method, None))
    You can see that there is a whole lot of 'None' in this, and we also have an extra tuple at the end which is a bit of a waste (vs just inlining the content). We could save 28 bytes/record (or 28*500k = 14MB) by just inlining that last (method, None). Though it changes some apis.
  13. Another thing to notice is that if you grep through the source code for uses of 'locations', you can see that we use the parents info and the index_memo, but we just ignore everything else. (method, compression_parent, and eol info are never interesting here). So really the result could be:
    result[key] = (self._node_to_position(entry), parents)
    This would be 28 + 4*2 = 36 vs (28+4*4 + 28+4*2) = 80, or saving 44b/record*.5M = 22MB. That is about 20% of that 122MB. Which isn't huge, but isn't a lot of effort to get. We could get a little better if we could collapse the node_to_position info along side the parents info, etc. (Say with a custom object.) That could shave another 28 bytes for the tuple, and maybe one extra reference.
  14. I ended up working on this, because it was like a 10 minute thing. I ended up creating this class (code at lp:
    class _GCBuildDetails(object):
    """A blob of data about the build details.

    This stores the minimal data, which then allows compatibility with the old
    api, without taking as much memory.

    __slots__ = ('_index', '_group_start', '_group_end', '_basis_end',
    '_delta_end', '_parents')

    method = 'group'
    compression_parent = None

    def __init__(self, parents, position_info):
    self._parents = parents
    self._index = position_info[0]
    self._group_start = position_info[1]
    # Is this _end or length? Doesn't really matter to us
    self._group_end = position_info[2]
    self._basis_end = position_info[3]
    self._delta_end = position_info[4]

    def __repr__(self):
    return '%s(%s, %s)' % (self.__class__.__name__,
    self.index_memo, self._parents)

    def index_memo(self):
    return (self._index, self._group_start, self._group_end,
    self._basis_end, self._delta_end)

    def record_details(self):
    return static_tuple.StaticTuple(self.method, None)

    def __getitem__(self, offset):
    """Compatibility thunk to act like a tuple."""
    if offset == 0:
    return self.index_memo
    elif offset == 1:
    return self.compression_parent # Always None
    elif offset == 2:
    return self._parents
    elif offset == 3:
    return self.record_details
    raise IndexError('offset out of range')

    def __len__(self):
    return 4
  15. The size of this class is 48 bytes, including the python object and gc overhead. This replaces the tuple(index_memo_tuple(index, start, end, start, end), None, parents, tuple(method, None)). Which is 28+4*4 + 28+4*5 + 28+4*2 = 128 bytes. So we save 80 bytes per record. on my repository that is ~10.6MB, on this dump it would be 40MB.
  16. The other bit to look at is measuring real-world results. Which looks
    something like this:
    >>> from bzrlib import branch, trace, initialize; initialize().__enter__()

    >>> b ='.')
    >>> b.lock_read()
    >>> keys = b.repository.texts.keys()
    >>> trace.debug_memory('holding all keys')
    WorkingSize 33192KiB PeakWorking 34772KiB holding all keys
    >>> locations = b.repository.texts._index.get_build_details(keys)
    >>> trace.debug_memory('holding all keys')
    WorkingSize 77604KiB PeakWorking 87960KiB holding all keys
Hopefully this has been informative. Digging into a bit of memory consumption, and how to determine where memory is being consumed, and a bit of understanding about how you can rework python objects to save a bit of memory (the biggest thing is to try to use fewer objects overall, since every object is at least 24 bytes, and that is if you are using __slots__. If you aren't then it is a minimum of 172 bytes (32 for the base object + 140 for its __dict__).

Monday, August 2, 2010

Meliae 0.3.0, statistics on subsets

Ah, yet another release. Hopefully with genuinely useful functionality.

In the process of inspecting yet another unexpected memory consumption, I came across a potential solution to the reference cycles problem.

Specifically, the issue is that often (at least in our codebases) you have coupled classes, that end up in a cycle, and you have trouble determining who "owns" what memory. In our case, the objects tend to be only 'loosely' coupled. In that one class passes off reference to a bound method to another object. However, a bound method holds a reference to the original object, so you get a cycle. (For example Repository passes its 'is_locked()' function down to the VersionedFiles so that they know whether it is safe to cache information. Repository "owns" the VersionedFiles, but they end up holding a reference back.)

What turned out to be useful was just adding an exclusion list to most operations. This ends up letting you find out about stuff that is referenced by object1, but is not referenced inside a specific subset.

One of the more interesting apis is the existing ObjManager.summarize().

So you can now do stuff like:
>>> om = loader.load('my.dump')
>>> om.summarize()
>>> om.summarize()
Total 5078730 objects, 290 types, Total size = 367.4MiB (385233882 bytes)
Index Count % Size % Cum Max Kind
0 2375950 46 224148214 58 58 4194313 str
1 63209 1 77855404 20 78 3145868 dict
2 1647097 32 29645488 7 86 20 bzrlib._static_tuple_c.StaticTuple
3 374259 7 14852532 3 89 304 tuple
4 138464 2 12387988 3 93 536 unicode

You can see that there is a lot of strings and dicts referenced here, but who owns them. Tracking into the references and using om.compute_total_size() just seems to get a lot of objects that reference everything. For example:
>>> dirstate = om.get_all('DirState')[0]
>>> om.summarize(dirstate)
Total 5025919 objects, 242 types, Total size = 362.0MiB (379541089 bytes)
Index Count % Size % Cum Max Kind
0 2355265 46 223321197 58 58 4194313 str

Now that did filter out a couple of objects, but when you track the graph, it turns out that DirState refers back to its WorkingTree, and WT has a Branch, which has the Repository, which has all the actual content. So what is actually referred to by just DirState?
>>> from pprint import pprint as pp
>>> pp(dirstate.refs_as_dict())
{'_bisect_page_size': 4096,
'_sha1_file': instancemethod(34050336 40B 3refs 1par),
'_sha1_provider': ContentFilterAwareSHA1Provider(41157008 172B 3refs 2par),
'crc_expected': -1471338016}
>>> pp(om[41157008].c)
[str(30677664 28B 265par 'tree'),
WorkingTree6(41157168 556B 35refs 7par),
type(39222976 452B 4refs 4par 'ContentFilterAwareSHA1Provider')]
>>> wt = om[41157168]
>>> om.summarize(dirstate, excluding=[wt.address])
Total 5025896 objects, 238 types, Total size = 362.0MiB (379539040 bytes)

Oops, I forgot an important step. Instances refer back to their type, and new-style classes keep an MRU reference all the way back to object which ends up referring to the whole dataset.
>>> om.remove_expensive_references()
removed 1906 expensive refs from 5078730 objs

Note that it doesn't take many references (just 2k out of 5M objects) to cause these problems.
>>> om.summarize(dirstate, excluding=[wt.address])
Total 699709 objects, 19 types, Total size = 42.2MiB (44239684 bytes)
Index Count % Size % Cum Max Kind
0 285690 40 20997620 47 47 226 str
1 212977 30 8781420 19 67 48 tuple
2 69640 9 8078240 18 85 116 set

And there you see that we have only 42MB that is directly referenced from DirState. (still more than I would like, but at least it is useful data, rather than just saying it references all objects).

I'm not 100% satisfied with the interface. Right now it takes an iterable of integer addresses. Which is often good because those integers are small and shared, so the only cost is the actual list. Taking objects requires creating the python proxy objects, which is something I'm avoiding because it actually requires a lot of memory to do so. (Analyzing 10M objects takes 1.1GB of peak ram, 780MB sustained.)