Monday, August 2, 2010

Meliae 0.3.0, statistics on subsets

Ah, yet another release. Hopefully with genuinely useful functionality.

In the process of inspecting yet another unexpected memory consumption, I came across a potential solution to the reference cycles problem.

Specifically, the issue is that often (at least in our codebases) you have coupled classes, that end up in a cycle, and you have trouble determining who "owns" what memory. In our case, the objects tend to be only 'loosely' coupled. In that one class passes off reference to a bound method to another object. However, a bound method holds a reference to the original object, so you get a cycle. (For example Repository passes its 'is_locked()' function down to the VersionedFiles so that they know whether it is safe to cache information. Repository "owns" the VersionedFiles, but they end up holding a reference back.)

What turned out to be useful was just adding an exclusion list to most operations. This ends up letting you find out about stuff that is referenced by object1, but is not referenced inside a specific subset.

One of the more interesting apis is the existing ObjManager.summarize().

So you can now do stuff like:
>>> om = loader.load('my.dump')
>>> om.summarize()
>>> om.summarize()
Total 5078730 objects, 290 types, Total size = 367.4MiB (385233882 bytes)
Index Count % Size % Cum Max Kind
0 2375950 46 224148214 58 58 4194313 str
1 63209 1 77855404 20 78 3145868 dict
2 1647097 32 29645488 7 86 20 bzrlib._static_tuple_c.StaticTuple
3 374259 7 14852532 3 89 304 tuple
4 138464 2 12387988 3 93 536 unicode
...

You can see that there is a lot of strings and dicts referenced here, but who owns them. Tracking into the references and using om.compute_total_size() just seems to get a lot of objects that reference everything. For example:
>>> dirstate = om.get_all('DirState')[0]
>>> om.summarize(dirstate)
Total 5025919 objects, 242 types, Total size = 362.0MiB (379541089 bytes)
Index Count % Size % Cum Max Kind
0 2355265 46 223321197 58 58 4194313 str
...

Now that did filter out a couple of objects, but when you track the graph, it turns out that DirState refers back to its WorkingTree, and WT has a Branch, which has the Repository, which has all the actual content. So what is actually referred to by just DirState?
>>> from pprint import pprint as pp
>>> pp(dirstate.refs_as_dict())
{'_bisect_page_size': 4096,
...
'_sha1_file': instancemethod(34050336 40B 3refs 1par),
'_sha1_provider': ContentFilterAwareSHA1Provider(41157008 172B 3refs 2par),
...
'crc_expected': -1471338016}
>>> pp(om[41157008].c)
[str(30677664 28B 265par 'tree'),
WorkingTree6(41157168 556B 35refs 7par),
type(39222976 452B 4refs 4par 'ContentFilterAwareSHA1Provider')]
>>> wt = om[41157168]
>>> om.summarize(dirstate, excluding=[wt.address])
Total 5025896 objects, 238 types, Total size = 362.0MiB (379539040 bytes)


Oops, I forgot an important step. Instances refer back to their type, and new-style classes keep an MRU reference all the way back to object which ends up referring to the whole dataset.
>>> om.remove_expensive_references()
removed 1906 expensive refs from 5078730 objs

Note that it doesn't take many references (just 2k out of 5M objects) to cause these problems.
>>> om.summarize(dirstate, excluding=[wt.address])
Total 699709 objects, 19 types, Total size = 42.2MiB (44239684 bytes)
Index Count % Size % Cum Max Kind
0 285690 40 20997620 47 47 226 str
1 212977 30 8781420 19 67 48 tuple
2 69640 9 8078240 18 85 116 set
...

And there you see that we have only 42MB that is directly referenced from DirState. (still more than I would like, but at least it is useful data, rather than just saying it references all objects).

I'm not 100% satisfied with the interface. Right now it takes an iterable of integer addresses. Which is often good because those integers are small and shared, so the only cost is the actual list. Taking objects requires creating the python proxy objects, which is something I'm avoiding because it actually requires a lot of memory to do so. (Analyzing 10M objects takes 1.1GB of peak ram, 780MB sustained.)

No comments: