Wednesday, August 4, 2010

Step-by-step Meliae

Some people asked me to provide a step-by-step guide to how to debug memory using Meliae. I just ran into another unknown situation, so I figured I'd post a step-by-step along with rationale for why I'm doing it.
  1. First is loading up the data. This was a dump while running 'bzr pack' of a large repository.
    >>> from meliae import loader
    >>> om = loader.load('big.dump')
    >>> om.remove_expensive_references()
    The last step is done because otherwise instances keep a reference to their class, and classes reference their base types, and you end up getting to 'object', and somewhere along the way you end up referencing too much. I don't do it automatically, because it does remove actual references, which someone might want to keep.
  2. Then, do a big summary, just to get started
    >>> om.summarize()
    Total 8364538 objects, 286 types, Total size = 440.4MiB (461765737 bytes)
    Index Count % Size % Cum Max Kind
    0 2193778 26 181553569 39 39 4194281 str
    1 12519 0 97231956 21 6012583052 dict
    2 1599439 19 68293428 14 75 304 tuple
    3 3459765 41 62169616 13 88 20 bzrlib._static_tuple_c.StaticTuple
    4 82 0 29372712 6 94 8388724 set
    5 1052573 12 12630876 2 97 12 int
    6 1644 0 4693700 1 98 2351848 list
    7 4038 0 2245128 0 99 556 _LazyGroupCompressFactory
  3. You can see that
    1. There are 8M objects, and about 440MB of reachable memory.
    2. The vast bulk of that is in strings, but there are also some oddities, like that 12.5MB dictionary
  4. At this point, I wanted to understand what was up with that big dictionary.
    >>> dicts = om.get_all('dict')
    >>> dicts[0]
    dict(417338688 12583052B 1045240refs 2par)
    om.get_all() gives you a list of all objects matching the given type string. It also sorts the returned list, so that the biggest items are at the
    beginning.
  5. Now lets look around a bit, to try to figure out where this dict lives
    >>> bigd = dicts[0]
    >>> from pprint import pprint as pp
    We'll use pprint a lot, so map it to something easy to type.
    >>> pp(bigd.p)
    [frame(39600120 464B 23refs 1par '_get_remaining_record_stream'),
    _BatchingBlockFetcher(180042960 556B 17refs 3par)]
  6. So this dict is contained in a frame, but also an attribute of _BatchingBlockFetcher. Let's try to see which attribute it is.
    >>> pp(bigd.p[1].refs_as_dict())
    {'batch_memos': dict(584888016 140B 4refs 1par),
    'gcvf': GroupCompressVersionedFiles(571002736 556B 13refs 9par),
    'keys': list(186984208 16968B 4038refs 2par),
    'last_read_memo': tuple(536280880 40B 3refs 1par),
    'locations': dict(417338688 12583052B 1045240refs 2par),
    'manager': _LazyGroupContentManager(584077552 172B 7refs 3716par),
    'memos_to_get': list(186983248 52B 1refs 2par),
    'total_bytes': 774119}
  7. It takes a bit to look through that, but you can see:
    'locations': dict(417338688 12583052B 1045240refs 2par)
    Note that 1045240refs means there are 522k key:value pairs in this dict.
  8. How much total memory is this dict referencing?
    >>> om.summarize(bigd)
    Total 4035636 objects, 22 types, Total size = 136.8MiB (143461221 bytes)
    Index Count % Size % Cum Max Kind
    0 1567864 38 66895512 46 46 52 tuple
    1 285704 7 24972909 17 64 226 str
    2 1142424 28 20757800 14 78 20 bzrlib._static_tuple_c.StaticTuple
    ...
    8 2 0 1832 0 99 1684 FIFOCache
    9 35 0 1120 0 99 32 _InternalNode
  9. So about 136MB out of 440MB is reachable from this dict. However, I'm noticing that FIFOCache and _InternalNode is also reachable, and those don't really seem to fit. I also notice that there are 1.6M tuples here, which is often a no-no. (If we are going to have that many tuples, we probably want them to be StaticTuple() because they use a fair amount less memory, can be interned, and aren't in the garbage collector. So lets poke around a little bit
    >>> bigd[0]
    bzrlib._static_tuple_c.StaticTuple(408433296 20B 2refs 9par)
    >>> bigd[1]
    tuple(618390272 44B 4refs 1par)
    >>> pp(bigd[0].c)
    [str(40127328 80B 473par 'svn-v4:138bc75d-0d04-0410-961f-82ee72b054a4:trunk:126948'),
    str(247098672 85B 37par '14@138bc75d-0d04-0410-961f-82ee72b054a4:trunk%2Fgcc%2Finput.h')]
    >>> pp(bigd[1].c)
    [tuple(618383880 36B 2refs 1par),
    bzrlib._static_tuple_c.StaticTuple(569848240 16B 1refs 3par),
    NoneType(505223636 8B 1074389par),
    tuple(618390416 48B 5refs 1par)]
    One thing to note, dict references are [key1, value1, key2, value2] while tuple references are (last, middle, first). I don't know why tuple.tp_traverse traverses in reverse order, but it does. And StaticTuple followed its lead.
    The things to take away from this is
    1. It is mapping a StaticTuple(file_id, revision_id) => tuple()
    2. The target tuple is actually quite complex, so we'll have to dig a bit deeper to figure it out.
    3. The file-id and revision-id are both referenced many times (37 and 473 respectively), so we seem to be doing a decent job sharing those strings.
  10. At this point, I would probably pull up the source code for _BatchingBlockFetcher, to try and figure out what is so big for locations. Looking at the source code, it is actually built in _get_remaining_record_stream as:
    locations = self._index.get_build_details(keys)
    This is then defined as returning:
      :return: A dict of key: (index_memo, compression_parent, parents, record_details).
  11. And the index memo contains a reference to the indexes themselves, but they don't really 'own' them. So lets filter them out:
    >>> indexes = om.get_all('BTreeGraphIndex')
    >>> om.summarize(bigd, excluding=[o.address for o in indexes])
    Total 3740667 objects, 6 types, Total size = 122.9MiB (128855911 bytes)
    Index Count % Size % Cum Max Kind
    0 1567860 41 66895360 51 51 48 tuple
    1 189162 5 19690647 15 67 226 str
    2 948160 25 17261048 13 80 20 bzrlib._static_tuple_c.StaticTuple
    3 1 0 12583052 9 9012583052 dict
    4 1035483 27 12425796 9 99 12 int
    5 1 0 8 0 100 8 NoneType
    (It is currently a bit clumsy that you have to do [o.address], but it means you can use large sets of ints. I'm still trying to sort that out.)
    The memory consumption here looks more realistic. You can also see that just the tuple objects by themselves consume 67MB, or 51% of the memory. You can also see that for a dict holding 500k entries, we have 1.5M tuples. So we are using 3 tuples per key.
  12. Note that we can't just use StaticTuple here, because index_memo[0] is the BTreeGraphIndex. Digging into the code, I think the data is all here:
     result[key] = (self._node_to_position(entry),
    None, parents, (method, None))
    You can see that there is a whole lot of 'None' in this, and we also have an extra tuple at the end which is a bit of a waste (vs just inlining the content). We could save 28 bytes/record (or 28*500k = 14MB) by just inlining that last (method, None). Though it changes some apis.
  13. Another thing to notice is that if you grep through the source code for uses of 'locations', you can see that we use the parents info and the index_memo, but we just ignore everything else. (method, compression_parent, and eol info are never interesting here). So really the result could be:
    result[key] = (self._node_to_position(entry), parents)
    This would be 28 + 4*2 = 36 vs (28+4*4 + 28+4*2) = 80, or saving 44b/record*.5M = 22MB. That is about 20% of that 122MB. Which isn't huge, but isn't a lot of effort to get. We could get a little better if we could collapse the node_to_position info along side the parents info, etc. (Say with a custom object.) That could shave another 28 bytes for the tuple, and maybe one extra reference.
  14. I ended up working on this, because it was like a 10 minute thing. I ended up creating this class (code at lp:
    class _GCBuildDetails(object):
    """A blob of data about the build details.

    This stores the minimal data, which then allows compatibility with the old
    api, without taking as much memory.
    """

    __slots__ = ('_index', '_group_start', '_group_end', '_basis_end',
    '_delta_end', '_parents')

    method = 'group'
    compression_parent = None

    def __init__(self, parents, position_info):
    self._parents = parents
    self._index = position_info[0]
    self._group_start = position_info[1]
    # Is this _end or length? Doesn't really matter to us
    self._group_end = position_info[2]
    self._basis_end = position_info[3]
    self._delta_end = position_info[4]

    def __repr__(self):
    return '%s(%s, %s)' % (self.__class__.__name__,
    self.index_memo, self._parents)

    @property
    def index_memo(self):
    return (self._index, self._group_start, self._group_end,
    self._basis_end, self._delta_end)

    @property
    def record_details(self):
    return static_tuple.StaticTuple(self.method, None)

    def __getitem__(self, offset):
    """Compatibility thunk to act like a tuple."""
    if offset == 0:
    return self.index_memo
    elif offset == 1:
    return self.compression_parent # Always None
    elif offset == 2:
    return self._parents
    elif offset == 3:
    return self.record_details
    else:
    raise IndexError('offset out of range')

    def __len__(self):
    return 4
  15. The size of this class is 48 bytes, including the python object and gc overhead. This replaces the tuple(index_memo_tuple(index, start, end, start, end), None, parents, tuple(method, None)). Which is 28+4*4 + 28+4*5 + 28+4*2 = 128 bytes. So we save 80 bytes per record. on my bzr.dev repository that is ~10.6MB, on this dump it would be 40MB.
  16. The other bit to look at is measuring real-world results. Which looks
    something like this:
    >>> from bzrlib import branch, trace, initialize; initialize().__enter__()

    >>> b = branch.Branch.open('.')
    >>> b.lock_read()
    LogicalLockResult(/2.3-gc-build-details/)>)
    >>> keys = b.repository.texts.keys()
    >>> trace.debug_memory('holding all keys')
    WorkingSize 33192KiB PeakWorking 34772KiB holding all keys
    >>> locations = b.repository.texts._index.get_build_details(keys)
    >>> trace.debug_memory('holding all keys')
    WorkingSize 77604KiB PeakWorking 87960KiB holding all keys
    >>>
Hopefully this has been informative. Digging into a bit of memory consumption, and how to determine where memory is being consumed, and a bit of understanding about how you can rework python objects to save a bit of memory (the biggest thing is to try to use fewer objects overall, since every object is at least 24 bytes, and that is if you are using __slots__. If you aren't then it is a minimum of 172 bytes (32 for the base object + 140 for its __dict__).

3 comments:

Karl said...

Are you sure you're using 0.2? This tutorial all looked very good, but I can't get very far.

>>> pp(bigd)
MemObject(20649360, dict, 786712 bytes, 23166 refs [139924259803184, 139924259803184, 139924258304624, 139924258304624, 79266512, 79266512, 139924259331360, 139924259331360, 139924259577200, 139924259577200, ...])
>>> pp(bigd.p)
Traceback (most recent call last):
File "", line 1, in
AttributeError: 'meliae._loader.MemObject' object has no attribute 'p'

It's a bit hard to get much further with this..

ChemicalStorm said...

You've to get the last version of Meliae to follow this (excellent) how to.

jignesh said...

Thanks for this information this was really helpful.
Passport application form