tag:blogger.com,1999:blog-44231759646089720682024-02-21T09:05:32.220-06:00John Arbash Meinel's Bazaar BlogPosts about the development of Bazaar, a distributed version control system, meant to be something developers like to use, rather than something that gets in the way.
http://bazaar-vcs.org/jamhttp://www.blogger.com/profile/17344213294371886790noreply@blogger.comBlogger37125tag:blogger.com,1999:blog-4423175964608972068.post-85746890056687812022010-08-04T15:50:00.008-06:002010-08-05T10:35:32.267-06:00Step-by-step MeliaeSome people asked me to provide a step-by-step guide to how to debug memory using Meliae. I just ran into another unknown situation, so I figured I'd post a step-by-step along with rationale for why I'm doing it.<br /><ol><li>First is loading up the data. This was a dump while running 'bzr pack' of a large repository.<pre>>>> from meliae import loader<br />>>> om = loader.load('big.dump')<br />>>> om.remove_expensive_references()</pre>The last step is done because otherwise instances keep a reference to their class, and classes reference their base types, and you end up getting to 'object', and somewhere along the way you end up referencing too much. I don't do it automatically, because it does remove actual references, which someone might want to keep.<br /></li><li>Then, do a big summary, just to get started <pre>>>> om.summarize()<br />Total 8364538 objects, 286 types, Total size = 440.4MiB (461765737 bytes)<br />Index Count % Size % Cum Max Kind<br />0 2193778 26 181553569 39 39 4194281 str<br />1 12519 0 97231956 21 6012583052 dict<br />2 1599439 19 68293428 14 75 304 tuple<br />3 3459765 41 62169616 13 88 20 bzrlib._static_tuple_c.StaticTuple<br />4 82 0 29372712 6 94 8388724 set<br />5 1052573 12 12630876 2 97 12 int<br />6 1644 0 4693700 1 98 2351848 list<br />7 4038 0 2245128 0 99 556 _LazyGroupCompressFactory</pre></li><li>You can see that<br /><ol><li>There are 8M objects, and about 440MB of reachable memory.<br /></li><li>The vast bulk of that is in strings, but there are also some oddities, like that 12.5MB dictionary</li></ol></li><li>At this point, I wanted to understand what was up with that big dictionary.<pre>>>> dicts = om.get_all('dict')<br />>>> dicts[0]<br />dict(417338688 12583052B 1045240refs 2par)</pre>om.get_all() gives you a list of all objects matching the given type string. It also sorts the returned list, so that the biggest items are at the<br />beginning.</li><li>Now lets look around a bit, to try to figure out where this dict lives<pre>>>> bigd = dicts[0]<br />>>> from pprint import pprint as pp<br />We'll use pprint a lot, so map it to something easy to type.<br />>>> pp(bigd.p)<br />[frame(39600120 464B 23refs 1par '_get_remaining_record_stream'),<br />_BatchingBlockFetcher(180042960 556B 17refs 3par)]</pre></li><li>So this dict is contained in a frame, but also an attribute of _BatchingBlockFetcher. Let's try to see which attribute it is.<pre>>>> pp(bigd.p[1].refs_as_dict())<br />{'batch_memos': dict(584888016 140B 4refs 1par),<br />'gcvf': GroupCompressVersionedFiles(571002736 556B 13refs 9par),<br />'keys': list(186984208 16968B 4038refs 2par),<br />'last_read_memo': tuple(536280880 40B 3refs 1par),<br />'locations': dict(417338688 12583052B 1045240refs 2par),<br />'manager': _LazyGroupContentManager(584077552 172B 7refs 3716par),<br />'memos_to_get': list(186983248 52B 1refs 2par),<br />'total_bytes': 774119}</pre></li><li>It takes a bit to look through that, but you can see:<pre>'locations': dict(417338688 12583052B 1045240refs 2par)</pre>Note that 1045240refs means there are 522k key:value pairs in this dict.<br /></li><li>How much total memory is this dict referencing?<pre>>>> om.summarize(bigd)<br />Total 4035636 objects, 22 types, Total size = 136.8MiB (143461221 bytes)<br />Index Count % Size % Cum Max Kind<br />0 1567864 38 66895512 46 46 52 tuple<br />1 285704 7 24972909 17 64 226 str<br />2 1142424 28 20757800 14 78 20 bzrlib._static_tuple_c.StaticTuple<br />...<br />8 2 0 1832 0 99 1684 FIFOCache<br />9 35 0 1120 0 99 32 _InternalNode<span style="font-family:Georgia,serif;"></span></pre></li><li><span style="font-family:Georgia,serif;">So about 136MB out of 440MB is reachable from this dict. However, I'm noticing that FIFOCache and _InternalNode is also reachable, and those don't really seem to fit. I also notice that there are 1.6M tuples here, which is often a no-no. (If we are going to have that many tuples, we probably want them to be StaticTuple() because they use a fair amount less memory, can be interned, and aren't in the garbage collector. So lets poke around a little bit<pre>>>> bigd[0]<br />bzrlib._static_tuple_c.StaticTuple(408433296 20B 2refs 9par)<br />>>> bigd[1]<br />tuple(618390272 44B 4refs 1par)<br />>>> pp(bigd[0].c)<br />[str(40127328 80B 473par 'svn-v4:138bc75d-0d04-0410-961f-82ee72b054a4:trunk:126948'),<br />str(247098672 85B 37par '14@138bc75d-0d04-0410-961f-82ee72b054a4:trunk%2Fgcc%2Finput.h')]<br />>>> pp(bigd[1].c)<br />[tuple(618383880 36B 2refs 1par),<br />bzrlib._static_tuple_c.StaticTuple(569848240 16B 1refs 3par),<br />NoneType(505223636 8B 1074389par),<br />tuple(618390416 48B 5refs 1par)]</pre>One thing to note, dict references are [key1, value1, key2, value2] while tuple references are (last, middle, first). I don't know why tuple.tp_traverse traverses in reverse order, but it does. And StaticTuple followed its lead.<br />The things to take away from this is<br /></span><ol><li>It is mapping a StaticTuple(file_id, revision_id) => tuple()</li><li>The target tuple is actually quite complex, so we'll have to dig a bit deeper to figure it out.</li><li>The file-id and revision-id are both referenced many times (37 and 473 respectively), so we seem to be doing a decent job sharing those strings.<br /></li></ol></li><li>At this point, I would probably pull up the source code for _BatchingBlockFetcher, to try and figure out what is so big for locations. Looking at the source code, it is actually built in _get_remaining_record_stream as:<pre>locations = self._index.get_build_details(keys)</pre>This is then defined as returning:<pre> :return: A dict of key: (index_memo, compression_parent, parents, record_details).</pre></li><li>And the index memo contains a reference to the indexes themselves, but they don't really 'own' them. So lets filter them out:<pre>>>> indexes = om.get_all('BTreeGraphIndex')<br />>>> om.summarize(bigd, excluding=[o.address for o in indexes])<br />Total 3740667 objects, 6 types, Total size = 122.9MiB (128855911 bytes)<br />Index Count % Size % Cum Max Kind<br /> 0 1567860 41 66895360 51 51 48 tuple<br /> 1 189162 5 19690647 15 67 226 str<br /> 2 948160 25 17261048 13 80 20 bzrlib._static_tuple_c.StaticTuple<br /> 3 1 0 12583052 9 9012583052 dict<br /> 4 1035483 27 12425796 9 99 12 int<br /> 5 1 0 8 0 100 8 NoneType</pre>(It is currently a bit clumsy that you have to do [o.address], but it means you can use large sets of ints. I'm still trying to sort that out.)<br />The memory consumption here looks more realistic. You can also see that just the tuple objects by themselves consume 67MB, or 51% of the memory. You can also see that for a dict holding 500k entries, we have 1.5M tuples. So we are using 3 tuples per key.</li><li>Note that we can't just use StaticTuple here, because index_memo[0] is the BTreeGraphIndex. Digging into the code, I think the data is all here:<pre> result[key] = (self._node_to_position(entry),<br /> None, parents, (method, None))</pre>You can see that there is a whole lot of 'None' in this, and we also have an extra tuple at the end which is a bit of a waste (vs just inlining the content). We could save 28 bytes/record (or 28*500k = 14MB) by just inlining that last (method, None). Though it changes some apis.</li><li>Another thing to notice is that if you grep through the source code for uses of 'locations', you can see that we use the parents info and the index_memo, but we just ignore everything else. (method, compression_parent, and eol info are never interesting here). So really the result could be:<br /> result[key] = (self._node_to_position(entry), parents)<br />This would be 28 + 4*2 = 36 vs (28+4*4 + 28+4*2) = 80, or saving 44b/record*.5M = 22MB. That is about 20% of that 122MB. Which isn't huge, but isn't a lot of effort to get. We could get a little better if we could collapse the node_to_position info along side the parents info, etc. (Say with a custom object.) That could shave another 28 bytes for the tuple, and maybe one extra reference.<br /></li><li>I ended up working on this, because it was like a 10 minute thing. I ended up creating this class (code at lp:<br /><pre>class _GCBuildDetails(object):<br /> """A blob of data about the build details.<br /><br /> This stores the minimal data, which then allows compatibility with the old<br /> api, without taking as much memory.<br /> """<br /><br /> __slots__ = ('_index', '_group_start', '_group_end', '_basis_end',<br /> '_delta_end', '_parents')<br /><br /> method = 'group'<br /> compression_parent = None<br /><br /> def __init__(self, parents, position_info):<br /> self._parents = parents<br /> self._index = position_info[0]<br /> self._group_start = position_info[1]<br /> # Is this _end or length? Doesn't really matter to us<br /> self._group_end = position_info[2]<br /> self._basis_end = position_info[3]<br /> self._delta_end = position_info[4]<br /><br /> def __repr__(self):<br /> return '%s(%s, %s)' % (self.__class__.__name__,<br /> self.index_memo, self._parents)<br /><br /> @property<br /> def index_memo(self):<br /> return (self._index, self._group_start, self._group_end,<br /> self._basis_end, self._delta_end)<br /><br /> @property<br /> def record_details(self):<br /> return static_tuple.StaticTuple(self.method, None)<br /><br /> def __getitem__(self, offset):<br /> """Compatibility thunk to act like a tuple."""<br /> if offset == 0:<br /> return self.index_memo<br /> elif offset == 1:<br /> return self.compression_parent # Always None<br /> elif offset == 2:<br /> return self._parents<br /> elif offset == 3:<br /> return self.record_details<br /> else:<br /> raise IndexError('offset out of range')<br /> <br /> def __len__(self):<br /> return 4<br /></pre></li><li>The size of this class is 48 bytes, including the python object and gc overhead. This replaces the tuple(index_memo_tuple(index, start, end, start, end), None, parents, tuple(method, None)). Which is 28+4*4 + 28+4*5 + 28+4*2 = 128 bytes. So we save 80 bytes per record. on my bzr.dev repository that is ~10.6MB, on this dump it would be 40MB.<br /></li><li>The other bit to look at is measuring real-world results. Which looks<br />something like this:<pre>>>> from bzrlib import branch, trace, initialize; initialize().__enter__()<br /><bzrlib.library_state.bzrlibrarystate><br />>>> b = branch.Branch.open('.')<br />>>> b.lock_read()<br />LogicalLockResult(<bound method BzrBranch7.unlock of BzrBranch7(file:///C:/Users/jameinel/dev/bzr/lp<br />/2.3-gc-build-details/)>)<br />>>> keys = b.repository.texts.keys()<br />>>> trace.debug_memory('holding all keys')<br />WorkingSize 33192KiB PeakWorking 34772KiB holding all keys<br />>>> locations = b.repository.texts._index.get_build_details(keys)<br />>>> trace.debug_memory('holding all keys')<br />WorkingSize 77604KiB PeakWorking 87960KiB holding all keys<br />>>></pre></li></ol>Hopefully this has been informative. Digging into a bit of memory consumption, and how to determine where memory is being consumed, and a bit of understanding about how you can rework python objects to save a bit of memory (the biggest thing is to try to use fewer objects overall, since every object is at least 24 bytes, and that is if you are using __slots__. If you aren't then it is a minimum of 172 bytes (32 for the base object + 140 for its __dict__).jamhttp://www.blogger.com/profile/17344213294371886790noreply@blogger.com3tag:blogger.com,1999:blog-4423175964608972068.post-42919263566996218232010-08-02T11:47:00.008-06:002010-08-02T12:34:01.945-06:00Meliae 0.3.0, statistics on subsetsAh, yet another release. Hopefully with genuinely useful functionality.<br /><br />In the process of inspecting yet another unexpected memory consumption, I came across a potential solution to the reference cycles problem.<br /><br />Specifically, the issue is that often (at least in our codebases) you have coupled classes, that end up in a cycle, and you have trouble determining who "owns" what memory. In our case, the objects tend to be only 'loosely' coupled. In that one class passes off reference to a bound method to another object. However, a bound method holds a reference to the original object, so you get a cycle. (For example Repository passes its 'is_locked()' function down to the VersionedFiles so that they know whether it is safe to cache information. Repository "owns" the VersionedFiles, but they end up holding a reference back.)<br /><br />What turned out to be useful was just adding an exclusion list to most operations. This ends up letting you find out about stuff that is referenced by object1, but is not referenced inside a specific subset.<br /><br />One of the more interesting apis is the existing ObjManager.summarize().<br /><br />So you can now do stuff like:<br /><pre>>>> om = loader.load('my.dump')<br />>>> om.summarize()<br />>>> om.summarize()<br />Total 5078730 objects, 290 types, Total size = 367.4MiB (385233882 bytes)<br />Index Count % Size % Cum Max Kind<br /> 0 2375950 46 224148214 58 58 4194313 str<br /> 1 63209 1 77855404 20 78 3145868 dict<br /> 2 1647097 32 29645488 7 86 20 bzrlib._static_tuple_c.StaticTuple<br /> 3 374259 7 14852532 3 89 304 tuple<br /> 4 138464 2 12387988 3 93 536 unicode<br /> ...</pre><br />You can see that there is a lot of strings and dicts referenced here, but who owns them. Tracking into the references and using <tt>om.compute_total_size()</tt> just seems to get a lot of objects that reference everything. For example:<pre>>>> dirstate = om.get_all('DirState')[0]<br />>>> om.summarize(dirstate)<br />Total 5025919 objects, 242 types, Total size = 362.0MiB (379541089 bytes)<br />Index Count % Size % Cum Max Kind<br /> 0 2355265 46 223321197 58 58 4194313 str<br />...</pre><br />Now that did filter out a couple of objects, but when you track the graph, it turns out that DirState refers back to its WorkingTree, and WT has a Branch, which has the Repository, which has all the actual content. So what is actually referred to by just DirState? <pre>>>> from pprint import pprint as pp<br />>>> pp(dirstate.refs_as_dict())<br />{'_bisect_page_size': 4096,<br />...<br />'_sha1_file': instancemethod(34050336 40B 3refs 1par),<br />'_sha1_provider': ContentFilterAwareSHA1Provider(41157008 172B 3refs 2par),<br />...<br />'crc_expected': -1471338016}<br />>>> pp(om[41157008].c)<br />[str(30677664 28B 265par 'tree'),<br />WorkingTree6(41157168 556B 35refs 7par),<br />type(39222976 452B 4refs 4par 'ContentFilterAwareSHA1Provider')]<br />>>> wt = om[41157168]<br />>>> om.summarize(dirstate, excluding=[wt.address])<br />Total 5025896 objects, 238 types, Total size = 362.0MiB (379539040 bytes)</pre><br /><br />Oops, I forgot an important step. Instances refer back to their type, and new-style classes keep an MRU reference all the way back to <tt>object</tt> which ends up referring to the whole dataset. <pre>>>> om.remove_expensive_references()<br />removed 1906 expensive refs from 5078730 objs</pre><br />Note that it doesn't take many references (just 2k out of 5M objects) to cause these problems.<br /><pre>>>> om.summarize(dirstate, excluding=[wt.address])<br />Total 699709 objects, 19 types, Total size = 42.2MiB (44239684 bytes)<br />Index Count % Size % Cum Max Kind<br /> 0 285690 40 20997620 47 47 226 str<br /> 1 212977 30 8781420 19 67 48 tuple<br /> 2 69640 9 8078240 18 85 116 set<br />...</pre><br />And there you see that we have only 42MB that is directly referenced from DirState. (still more than I would like, but at least it is <b>useful</b> data, rather than just saying it references all objects).<br /><br />I'm not 100% satisfied with the interface. Right now it takes an iterable of integer addresses. Which is often good because those integers are small and shared, so the only cost is the actual list. Taking objects requires creating the python proxy objects, which is something I'm avoiding because it actually requires a lot of memory to do so. (Analyzing 10M objects takes 1.1GB of peak ram, 780MB sustained.)jamhttp://www.blogger.com/profile/17344213294371886790noreply@blogger.com0tag:blogger.com,1999:blog-4423175964608972068.post-47127021209043898102010-07-20T09:49:00.007-06:002010-07-20T09:54:41.942-06:00Meliae 0.2.1Meliae 0.2.1 is now officially released.<br /><br />The list of changes isn't great, it is mostly a bugfix release. There are a couple of quality-of-life changes.<br /><br />For example you used to need to do:<br /><pre>>>> om = loader.load(filename)<br />>>> om.compute_parents()<br />>>> om.collapse_instance_dicts()</pre><br />However, that is now done as part of <span style="font-family: courier new;">loader.load()</span>. This also goes along with a small bug fix to <span style="font-family: courier new;">scanner.dump_all_objects()</span> that makes sure to avoid dumping the <span style="font-family: courier new;">gc.get_objects()</span> list, since that is artifact from scanning, and not actually something you care about in the dump file.<br /><br />Many thanks to Canonical for bringing me to Prague for the Launchpad Epic, giving me some time to work on stuff that isn't just Bazaar.jamhttp://www.blogger.com/profile/17344213294371886790noreply@blogger.com1tag:blogger.com,1999:blog-4423175964608972068.post-54044418782042330572010-01-09T07:46:00.007-06:002010-01-09T08:29:05.649-06:00Meliae 0.2.0And here we are, with a new release of Meliae <a href="https://edge.launchpad.net/meliae/0.2/0.2.0">0.2.0</a>. This is a fairly major reworking of the internals, though it should be mostly compatible with 0.1.2. (The disk format did not change, most of the apis have deprecated thunks to help you migrate.)<br /><br />The main difference is how data is stored in memory. Instead of using a Python dict + python objects, I know use a custom data collection. Python's generic objects are great for getting stuff going, but I was able to cut memory consumption in half with a custom object. This means that finally, analyzing a 600MB dump takes less than 600MB of memory (currently takes about ~300MB). Of course that also depends on your data structures (600MB dump that is one 500MB string will take up very little memory for analysis.)<br /><br />The second biggest feature is hopefully a cleaner interface.<br /><ol><li>Call references 'parents' or 'children'. Indicating objects which point to me, and objects which I point to, respectively. 'ref_list' and 'referrers' was confusing. Both start with 'ref', so it takes a bit to sort them out.</li><li>Add attributes to get direct access to parents and children, rather than having to go back through the ObjManager.</li><li>Change the formatting strings to be more compact. No longer show the refs by default, since you can get to the objects anyway.</li></ol>A third minor improvement is support for collapsing old-style classes (ones that don't inherit from 'object'.)<br /><br />So how about an example. To start with, you need a way to interrupt your running process and get a dump of memory. I can't really give you much help, but you'll end up wanting:<br /><pre>from meliae import scanner<br />scanner.dump_all_objects('test-file.dump')</pre><br />(This is the simplest method. There are others that take less memory while dumping, if overhead is a concern.)<br /><br />Once you have that dump file, start up another python process and let's analyze it.<br /><pre>$ python<br />>>> from meliae import loader<br />>>> om = loader.load('test-file.dump')<br />loaded line 3579013, 3579014 objs, 377.4 / 377.4 MiB read in 79.6s</pre><br />I recommend just always running these lines. If you used a different method of dumping, there are other things to do, which is why it isn't automatic (yet).<br /><pre>>>> om.compute_parents(); om.collapse_instance_dicts()<br />set parents 3579013 / 3579014<br />checked 3579013 / 3579014 collapsed 383480<br />set parents 3195533 / 3195534</pre><br />Now we can look at the data, and get a feel for where our memory has gone:<br /><pre>>>> s = om.summarize(); s<br />Total 3195534 objects, 418 types, Total size = 496.8MiB (520926557 bytes)<br />Index Count % Size % Cum Max Kind<br /> 0 189886 5 211153232 40 40 1112 Thread<br /> 1 199117 6 72510520 13 5412583192 dict<br /> 2 189892 5 65322848 12 66 344 _Condition<br /> 3 380809 11 30464720 5 72 80 instancemethod<br /> 4 397892 12 28673968 5 78 2080 tuple<br /> 5 380694 11 27409968 5 83 72 builtin_function_or_method<br /> 6 446606 13 26100905 5 88 14799 str<br /> 7 189886 5 21267232 4 92 112 _socketobject<br /> 8 197255 6 14568080 2 95 14688 list<br />...</pre><br />At this point, you can see that there are 190k instances of Thread, which is consuming 40% of all memory. There is also a very large 12.5MB dict. (It turns out that this dict holds all of those Thread objects.)<br /><br />But how do we determine that. One thing we can do is just get a handle to all of those Thread instances<br /><pre>>>> threads = om.get_all('Thread')<br />>>> threads[0]<br />Thread(32874448 1112B 23refs 3par)</pre><br />So this thread is at address 32874448 (not particularly relevant), consumes 1112 bytes of memory (including its dict, since we collapsed threads), references 23 python objects, and is referenced by 3 python objects.<br /><br />Lets see those references<br /><pre>>>> threads[0].c # shortcut for 'children'<br />[str(11409312 54B 189887par '_Thread__block'), _Condition(32903248 344B 11refs<br /> 1par), str(11408976 53B 189887par '_Thread__name'), str(32862080 77B 1par <br />'PoolThread-twisted.internet.reactor-1'), str(1...</pre><br />It looks like there might be something interesting there, but it is a bit hard to sort out. Step one is to try using python's pprint utility.<br /><pre>>>> from pprint import pprint as pp<br />>>> pp(threads[0].c)<br />[str(11409312 54B 189887par '_Thread__block'),<br /> _Condition(32903248 344B 11refs 1par),<br /> str(11408976 53B 189887par '_Thread__name'),<br /> str(32862080 77B 1par 'PoolThread-twisted.internet.reactor-1'),<br /> str(11429168 57B 189887par '_Thread__daemonic'),<br /> bool(7478912 24B 572370par 'False'),<br /> str(11409200 56B 189887par '_Thread__started'),<br /> bool(7478944 24B 571496par 'True'),<br />...</pre><br />That's a bit better, but I also now that instances have a dict, so lets try:<br /><pre>>>> pp(threads[0].refs_as_dict)<br />{'_Thread__args': tuple(140013759823952 56B 2008par),<br /> '_Thread__block': _Condition(32903248 344B 11refs 1par),<br /> '_Thread__daemonic': False,<br /> '_Thread__initialized': True,<br /> '_Thread__kwargs': dict(32516192 280B 1par),<br /> '_Thread__name': 'PoolThread-twisted.internet.reactor-1',<br /> '_Thread__started': True,<br />...</pre><br />(Note to self, find a good way to shorten 'refs_as_dict', too much typing) Now that is starting to look like you can actually understand what is going on.<br /><br />Another question to ask, who is referencing this object (why is it still active)?<br /><pre>>>> pp(threads[0].p)<br />[list(33599432 104B 1refs 1par),<br /> list(33649944 104B 1refs 1par),<br /> dict(11279168 1048B 10refs 1par)]</pre><br />So this thread is in 2 lists and a dict with 10 items. So what about the parents of the parents<br /><pre>>>> pp(threads[0].p[0].p)<br />[ThreadPool(32888520 1120B 21refs 2par)]</pre><br />So the first list is held by a ThreadPool. We can quick check info about that object:<br /><pre>>> pp(threads[0].p[0].p[0].refs_as_dict())<br />{'joined': False,<br /> 'max': 10,<br /> 'min': 0,<br /> 'name': 'twisted.internet.reactor',<br /> 'q': Queue(32888592 1120B 15refs 1par),<br /> 'started': True,<br /> 'threads': list(33599432 104B 1refs 1par),<br /> 'waiters': list(33649944 104B 1refs 1par),<br /> 'workers': 1,<br /> 'working': list(33649656 72B 1par)}</pre><br />So that seems to be a Twisted thread pool.<br />What about the other parents?<br /><pre>>>> pp(threads[0].p[1].p)<br />[ThreadPool(32888520 1120B 21refs 2par)]</pre><br />Also a list held by a ThreadPool<br /><pre>>>> pp(threads[0].p[2].p)<br />[dict(11253824 3352B 98refs 70par)]</pre><br />Hmmm, now we have a dict pointing to 98 objects which is, itself, referenced by 70 objects. This at least seems worth investigating further.<br /><pre>>>> d = threads[0].p[2].p[0]<br />>>> d<br />dict(11253824 3352B 98refs 70par)</pre><br />Yep, that's the one. We can try to dump it as a dict<br /><pre>>>> pp(d.refs_as_dict())<br />>>> pp(d.refs_as_dict())<br />{'BoundedSemaphore': 'BoundedSemaphore',<br /> 'Condition': 'Condition',<br /> 'Event': 'Event',<br /> 'Lock': builtin_function_or_method(10872592 72B 1refs 7par),<br /> 'RLock': 'RLock',<br /> 'Semaphore': 'Semaphore',<br /> 'Thread': 'Thread',<br />...</pre><br />Now, the formatting here is actually hiding something. Namely that the referred to object is actually a type:<br /><pre>>>> d.c[1]<br />type(11280288 880B 4refs 2par '_BoundedSemaphore')</pre><br />From experience, I know that this is probably a module's dict. It has a lot of objects, and a lot of objects referencing it. (all functions references their modules global dict.) I'm still working on how to collapse a modules __dict__ into the module itself for clarity. Anyway, lets look at the parents to see what module this is.<br /><pre>>>> pp([p for p in d.p if p.type_str == 'module'])<br />[module(11411416 56B 1refs 18par 'threading')]</pre><br />And there you go, the threading module.<br /><br />And that's how you walk around the object graph. To finish analyzing this memory, I would probably poke at all the thread objects, and see what they are trying to accomplish. But mostly, the summary tells you that something is wrong. You shouldn't really be able to have 200k active threads doing real work. So probably you have something that is accidentally preserving threads that are no longer active.jamhttp://www.blogger.com/profile/17344213294371886790noreply@blogger.com3tag:blogger.com,1999:blog-4423175964608972068.post-86809022120967096422009-11-16T16:28:00.008-06:002009-11-17T13:49:08.345-06:00Memory Debugging with Meliae<span style="font-size:130%;"><span style="font-weight: bold;">Background of Meliae 0.1.0</span></span><br /><br />Earlier this year I started working on a new memory debugging program for python. I had originally tried to use <a href="http://guppy-pe.sourceforge.net/#Heapy">heapy</a>, but at the time it didn't support Windows, Mac, or 64-bit environments. (Which turned out to be all of my interesting platforms.) The other major problem is that I'm often debugging memory consumption of up to a GB of active data. While I think some of the former issues have been fixed, the latter is still a major issue for me.<br /><br />So with the help of <a href="https://launchpad.net/%7Emwhudson">Michael Hudson</a>, I started putting together a new structure. The code would be split into a scanner and a processor (loader). Such that you can interrupt a running process, dump the memory consumption to disk, and then analyze it in a separate process. (Often after the former has stopped.) The scanner can have a minimal memory profile, so even if your system is already swapping, you can dump out the memory info. (<a href="http://rbtcollins.wordpress.com/">Robert Collins</a> successfully dumped a 6GB memory profile, though analyzing that beast is still an issue.) The other advantage of this system, is that I don't have to play tricks with objects that represent the current state, like Guppy does with all sorts of crazy decorators.<br /><br />In recent months, I've also focused on improving Bazaar's memory profile, which also meant improving memory profiling. Enough that I felt it was worth releasing the code. So officially <span style="font-weight: bold;">Meliae</span> 0.1.0 has been released. (For those wondering about the name, it is from <a href="http://en.wikipedia.org/wiki/Meliae">Ash-Wood Nymph in Greek Mythology</a>, aka it is just a fun name.)<br /><br /><span style="font-weight: bold;"><span style="font-size:130%;">Doing real work</span></span><br />So how does one actually use the program. <a href="http://bazaar-vcs.org/en">Bazaar</a> has a very nice <a href="http://bazaar.launchpad.net/%7Ebzr-pqm/bzr/bzr.dev/annotate/head%3A/bzrlib/breakin.py">ability</a>, that you can use SIGQUIT (Ctrl+|) or SIGBREAK (Ctrl+Pause/Break) to drop into a debugger in the middle of a process to figure out what is going on. At that point, you can just:<br /><pre>from meliae import scanner<br />scanner.dump_all_objects('filename.json')</pre><span style="font-size:85%;">(There is an alternative scanner.dump_gc_objects() which has even lower memory profile, but will dump some objects more than once, creating a larger dump file.)</span><br /><br />This creates a file describing all of the Python objects it was able to find along with their known size, references, and for some objects (strings, ints) their content. From there, you start another shell, and use:<br /><pre>>>> from meliae import loader<br />>>> om = loader.load('filename.json')<br />>>> s = om.summarize(); s<br /><br />This dumps out something like:<br />Total 17916 objects, 96 types, Total size = 1.5MiB (1539583 bytes)<br />Index Count % Size % Cum Max Kind<br /> 0 701 3 546460 35 35 49292 dict<br /> 1 7138 39 414639 26 62 4858 str<br /> 2 208 1 94016 6 68 452 type<br /> 3 1371 7 93228 6 74 68 code<br /> 4 1431 7 85860 5 80 60 function<br /> 5 1448 8 59808 3 84 280 tuple<br /> 6 552 3 40760 2 86 684 list<br /> 7 56 0 29152 1 88 596 StgDict<br /> 8 2167 12 26004 1 90 12 int<br /> 9 619 3 24760 1 91 40 wrapper_descriptor<br /> 10 570 3 20520 1 93 36 builtin_function_or_method<br /> ...</pre><br />Showing the top objects and what data they consume. This can often be revealing it itself. Do you have millions of tuples? One giant dict that is consuming a surprising amount of memory? (A dict with 200k entries is ~6MB on a 32-bit platform.)<br /><br />There is more that can be done. You can run:<br /><pre>om.compute_referrers()</pre><br />At this point, you can look at a single node, and find out what was referencing it. (So what was referencing that largest dict?)<br /><pre>>>> om[s.summaries[0].max_address]<br />MemObject(29351984, dict, 49292 bytes, 1578 refs [...], 1 referrers [26683840])<br /><br />>>> om[26683840]<br />MemObject(29337264, function, format_string, 60 bytes, 6 refs...)</pre><br />However, it also turns out that all 'classic' classes in Python indirect to their data via self.__dict__, which is a bit annoying to walk through. It also makes it looks like 'dict' is the #1 memory consumer, when actually it might be instances of Foo, which happen to use dicts. So you can use<br />om.collapse_instance_dicts()<br /><br />Which will find all instances that seem to have trivial references to a __dict__, and then collapse it so that all references are directly from the instance, and all referenced objects then claim the instance as the referrer.<br /><br />The above dump changes to:<br /><pre>>>> s = om.summarize(); s<br />Total 17701 objects, 96 types, Total size = 1.5MiB (1539583 bytes)<br />Index Count % Size % Cum Max Kind<br /> 0 7138 40 414639 26 26 4858 str<br /> 1 486 2 394632 25 52 49292 dict<br /> 2 208 1 94016 6 58 452 type<br /> 3 1371 7 93228 6 64 68 code<br /> 4 1431 8 85860 5 70 60 function<br /> 5 149 0 82844 5 75 556 ReadLineTextBuffer<br /> 6 93 0 65384 4 79 6312 module<br /> 7 1448 8 59808 3 83 280 tuple<br /> 8 552 3 40760 2 86 684 list<br /> 9 56 0 29152 1 88 596 StgDict<br /> 10 2167 12 26004 1 90 12 int</pre><br />Which shows that ReadLineTextBuffer is actually a large consumer of memory.<br /><br /><span style="font-size:130%;"><span style="font-weight: bold;">Misc</span></span><br />There are other bits to explore, and improvements to be made. "scanner.get_recursive_size()" can be useful if you don't want to dump out a big file to analyze memory referenced from a given object (such as a cache). It doesn't give the whole picture, but can be useful in an interactive session.<br /><br /><br /><span style="font-size:130%;"><span style="font-weight: bold;">Results</span></span><br />In the end, this code has enabled us to cut the memory consumption of Bazaar<br />roughly in half (for <tt>bzr branch</tt>). It also lets you see nice summaries<br />like this:<br /><br /><pre>Total 2805995 objects, 276 types, Total size = 946.0MiB (991983819 bytes)<br />Index Count % Size % Cum Max Kind<br /> 0 1939090 69 916011611 92 92 5762600 str<br /> 1 9449 0 33069868 3 95 3145868 dict<br /> 2 132202 4 12506732 1 96 536 unicode<br /> 3 383436 13 7048652 0 97 20 bzrlib._static_tuple_c.StaticTuple<br /> 4 160027 5 5873744 0 98 304 tuple<br /> 5 5429 0 5185252 0 98 412236 list<br /> 6 62256 2 4482432 0 99 72 InventoryFile<br /> 7 148 0 1334032 0 99 1048692 set<br /> 8 2185 0 1214860 0 99 556 GroupCompressBlock<br /> 9 8003 0 992372 0 99 124 CHKInventoryDirectory<br />...</pre><br /><br />(Note that after seeing this, we changed the code to not cache as many strings in memory, and I managed to decrease memory consumption to about 1/3rd it once was for this operation.)<br /><br />The code isn't perfect, but being able to get a view of where memory is going, and what objects are holding on to it, is a huge improvement over just being in the dark.jamhttp://www.blogger.com/profile/17344213294371886790noreply@blogger.com2tag:blogger.com,1999:blog-4423175964608972068.post-76425623443677356502009-10-15T09:30:00.009-06:002009-10-15T10:24:17.729-06:00The Joys of multiple releasesI had originally written a longer post over at wordpress, only to have Firefox crash while trying to move an image, and WP doesn't do auto-saving like blogger. So now I'm back...<br /><br />Bazaar 2.0.1 and 2.1.0b1 have now 'gone gold' in that I've uploaded the official tarballs, and asked people to make installers for them. Once installers are made, then we'll make the official announcement.<br /><br />For those who haven't been following, Bazaar has now split its releases into 2 series. The 2.0.x series is based on 2.0.0 and has only bugfixes. Things that could cause compatibility problems (new features, removal of deprecated code, etc.) is only done in the 2.1.0.x series. We're hoping that this can give people some flexibility, as well as giving <span style="font-weight: bold;">us</span> more flexibility. In the past, we've suffered a bit trying to maintain backwards compatibility for some features/bugfixes, only to break compatibility for a big feature. Instead of suffering the worst of both, we're trying to get the best of both. If something needs to break compatibility, it just goes in the dev branch. Note that the development branch is still considered 'stable', in that the test suite always passes, and the code is pretty much always ready for a release. We just don't make the same guarantees about stable internal apis for 3rd parties to use.<br /><br />The other change to the process is to stop doing as many "release candidate" builds. Instead, we will just cut a release. If there are problems, we'll cut the next release sooner. The chance for regressions in the 'bugfix-only' 2.0.x series should be low, and getting away from pre-builds means less overhead. We will still be doing releases we call 'rc1' before the next major stable release (2.1.0), and in that vein we expect to do little-to-no changes from the rc1 to the final build.<br /><br />However, this new system does increase overhead for a single release. As now it is equivalent to doing the rc and the final in the same day. Also, because we now have 2 "integration" branches, it requires a bit more coordination between them.<br /><br />For example, this is the revision graph for the recent 2.0.1 and 2.1.0b1 release<br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhzhX8d0iGPJY6y0eOp36Syg6W6NYOKRebjbGa8yz1NSTU8Q5cFMty8j_BdbAk3nzZ54SPFtYNGImyBD1205BWi4njgXFBgEWA0qZvDxh-u5jXCF5kuSzEkdYISj345AVaMi2Q78t-csWVR/s1600-h/2.0.1and2.1.0b1.png"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 204px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhzhX8d0iGPJY6y0eOp36Syg6W6NYOKRebjbGa8yz1NSTU8Q5cFMty8j_BdbAk3nzZ54SPFtYNGImyBD1205BWi4njgXFBgEWA0qZvDxh-u5jXCF5kuSzEkdYISj345AVaMi2Q78t-csWVR/s400/2.0.1and2.1.0b1.png" alt="" id="BLOGGER_PHOTO_ID_5392855159103734322" border="0" /></a><br />The basic workflow that I used was something like<br /><ol><li>Have a <a href="https://edge.launchpad.net/%7Ecanonical-losas">LOSA </a>create 2 release branches <a href="https://code.edge.launchpad.net/%7Ebzr-pqm/bzr/2.0.1">lp:~bzr-pqm/bzr/2.0.1</a> and <a href="https://code.edge.launchpad.net/%7Ebzr-pqm/bzr/2.1.0b1">lp:~bzr-pqm/bzr/2.1.0b1</a></li><li>Create a local branch of each</li><li>Create another branch for doing my updates in, such as <a href="https://code.edge.launchpad.net/%7Ejameinel/bzr/2.0.1">lp:~jameinel/bzr/2.0.1</a></li><li>Update 2.0.1 with a new version string</li><li>Update NEWS to clean it up, show that there is an official release, and provide a summary/overview of the changes.</li><li>Land this update into the official 2.0.1 branch via PQM. (Unfortunately this can take up to 2 hours depending on a bunch of different factors. We are trying to get this down to more like 10 min.)</li><li>Update my local copy from the final release. Tag it (bzr-2.0.1).<br /></li><li>Create the tarball</li><li>Create the release launchpad</li><li>Upload the tarball to the release</li><li>While this is going on, go through the bugtracker and make sure that things mentioned in NEWS have the appropriate "Fix Released" state in the bug tracker, as well as being associated with the right milestones. With 34 bugfixes, this is a non-trivial undertaking.<br /></li><li>Merge the 2.0.1 final release into the 2.1.0b1 branch. (All bugfixes in the stable series are candidates for merging at any time into the development series.)</li><li>Do lots of cleanup in NEWS. The main difficulty here is that bugfixes are present on 2 integration branches simultaneously, and those releases are slightly independent. We've talked about having the bugfix mentioned in both sections. Which would be more important if we ever make a development release <span style="font-style: italic;">without</span> doing the corresponding stable release.</li><li>Do steps 4-10 again for 2.1.0b1.</li><li>While working or waiting on that, prepare <a href="https://code.edge.launchpad.net/%7Ebzr-pqm/bzr/2.0">lp:~bzr-pqm/bzr/2.0</a> since it is now going to be prepped for 2.0.2. This involves, bumping the version number, updating NEWS with blank entries for the next release (avoids some conflicts for people landing changes in that branch), and submitting all of that back to PQM.</li><li>When that has finished, bring the 2.0 stable branch back into bzr.dev. And prepare bzr.dev for 2.1.0b2. (version number bumps, NEWS cleanups, etc.)</li><li>In this case, cleaning up NEWS was again a bit of a chore. As now you have a file that should have a blank area for both the 2.1.0b2 changes, but also the 2.0.2 changes. Further, some of the changes that landed in bzr.dev in the mean-time, were not included in the 2.1.0b1 release. So you have to move them up into the new section. Getting NEWS right across 4 branches was quite a bit of work, and probably the hardest part (so far) of the process. Copy & Paste + bzr diff + bzr vimdiff we quite helpful here. Setting the news in bzr.dev to the exact copy from 'bzr-2.1.0b1' and then showing what was removed/added was a nice way to make sure to get everything .</li><li>breathe</li><li>Announce the tarballs, etc on the bzr mailing list, so that people can start preparing packages/installers.</li><li>I'm also the windows installer packager, so I get to build around 8 installers... (standalone installers, and 3 python [2.4, 2.5, 2.6] installers.) Most of this is scripted, but often something breaks along the way.</li><li>Update Pypi, freshmeat, ... with updates for the new versions. Twice. (At least here we did not update for 'rc' versions. So this work is strictly doubled.)</li></ol>Overall, it is a fair amount of work. I think it will amount to 1-2 days of work time (spread out over 3+ days of real time). With any luck, it will amount to being 'more concentrated, but less often'. I would say that we'd get more practice, but we also try to rotate release managers. Both to spread the knowledge, and to avoid burnout. (Though Martin has been doing the last 4-or-so releases...)<br /><br />So here's to everyone upgrading to their preferred release (in about a week's time).jamhttp://www.blogger.com/profile/17344213294371886790noreply@blogger.com2tag:blogger.com,1999:blog-4423175964608972068.post-38866805112904634002009-10-08T09:02:00.008-06:002009-10-08T10:11:47.199-06:00Refactoring work for review (and keep your annotations)Tim Penhey recently had a nice <a href="http://how-bazaar.blogspot.com/2009/07/breaking-up-work-for-reivew.html">post</a> about how he split up his changes to make it easier to review. His method used 'bzr pipeline' and some combinations of shelving, merging, and reverting the merges.<br /><br />However, while I wanted to refactor my changes to make it easier to review, I didn't want to lose my annotation history. So I took a different approach.<br /><br />To start, with, I'll assume you have a single branch with lots of changes, each wrt different features. You developed them 'concurrently' (jumping back and forth between features, without actually committing it to a different branch). And now that you are done, you want to split them out again.<br /><br />There are a lot of possible ways that you can do this. With some proponents prefering a 'rebase' style. Where you replay the commits you made in a new order, possibly squashing them, etc. I'm personally not a big fan of that.<br /><a href="http://how-bazaar.blogspot.com/2009/07/breaking-up-work-for-reivew.html">Tim'</a>s is another method, where you just cherrypick the changes into new branches, and use something like bzr-pipeline to manage the layering. However in reading his workflow, he would also lose the history of the individual changes.<br /><br />So this is my workflow.<br /><ol><li>Start with a branch that has a whole lot of changes on it, and is essentially 'done'. We'll call this branch "dogpile".</li><li>Create a new branch from it (<span style="font-family:courier new;">bzr branch --switch ../dogpile ../feature1</span>), and remove all of the changes but the 'first step'. I personally did that with "<span style="font-family:courier new;">bzr revert -r submit: file1 file2 file3</span>" but left "file4" alone.</li><li>"<span style="font-family:courier new;">bzr commit</span>" in that branch. The delta for that revision will show a lot of your hard-worked on changes being removed. However "<span style="font-family:courier new;">bzr diff -r submit:</span>" should show a very nice clean patch that only includes the changes for "feature1".<br /></li><li>Go back to the original dogpile branch, and create a new "feature2" branch. (<span style="font-family:courier new;">bzr branch --switch ../dogpile ../feature2</span>)</li><li>Now merge the "feature1" branch (bzr merge ../feature1). At this point, it looks like everything has been removed except for the bits for feature1. However, just using "bzr revert file2..." we can restore the changes for "feature2".</li><li>You can track your progress in a few ways. "bzr diff -r submit:" will show you the combine differences from feature1 and feature2. "bzr diff -r -1:../feature1" will show you just the differences between the current feature2 branch and the feature1 branch. The latter is what you want to be cleaning up, so that it includes all of your feature2 changes, built on top of your feature1 changes. You also have the opportunity to tweak the code a bit, and run the test suite to make sure things are working correctly.<br /></li><li>"bzr commit" in that branch. At this point, the diff from upstream to feature1 should be clean, and the diff from feature1 => feature2 should be clean. As an added benefit, doing "bzr annotate file2" will preserve all the hard-won history of the file.</li><li>repeat steps 4-7 for all the other features you wanted to split out into their own branches.</li></ol>When you are done, you will have N feature branches, split up from the original "dogpile" branch. By using the "merge + revert things back into existence" trick, you can preserve all of the annotations for your files. This works because you have 2 sources that the file content could come from. One source is the "dogpile" branch, and the other source is a branch where "dogpile" changes were removed. Since the changes are present in one of the parents, the annotations are brought from there.<br /><br />This is what the 'qlog' of my refactoring looks like.<br /><a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEho5h1kd7M_5333_9O19qY_MeZJxS2tHILh8OcY00lbTCSES4ldGwDdXFiP_yZBbEMRo9gLmgKSWIix3EhehDMrJdRRCYdIrmrfulpRAzEcktkaLnQMIeKr6xGkizOQaF92oQ8IbTyu7kwS/s1600-h/static-tuple-refactoring.png"><img style="margin: 0px auto 10px; display: block; text-align: center; cursor: pointer; width: 400px; height: 109px;" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEho5h1kd7M_5333_9O19qY_MeZJxS2tHILh8OcY00lbTCSES4ldGwDdXFiP_yZBbEMRo9gLmgKSWIix3EhehDMrJdRRCYdIrmrfulpRAzEcktkaLnQMIeKr6xGkizOQaF92oQ8IbTyu7kwS/s400/static-tuple-refactoring.png" alt="" id="BLOGGER_PHOTO_ID_5390261527821036978" border="0" /></a><br />The actual content changes (the little grey dots) actually span about 83 commits. However, you can see that I split that up into 6 new branches (some more independent than others), all of which generate a neat difference to their parent, and preserve all of the annotation information from the full history. You can also see that now that I have it split out, I can do simple changes to each branch (notice that purple has an extra commit). This will most likely come into play if people ask for any changes during review.jamhttp://www.blogger.com/profile/17344213294371886790noreply@blogger.com3tag:blogger.com,1999:blog-4423175964608972068.post-54337231168379360462009-03-23T08:25:00.015-06:002009-03-24T09:48:19.663-06:00brisbane-core<a href="http://code.mumak.net/2009/03/brisbane-core.html">Jonathan Lange</a> decided to drop some hints about what is going on in Bazaar, and I figured I could give a bit more detail about what is going on. "Brisbane-core" is the code name we have for our next generation repository format, since we started working on it in our November sprint in Brisbane last year.<br /><br />I'd like to start by saying we are really excited about how things are shaping up. We've been doing focused work on it for at least 6 months now. Some of the details are up on our <a href="http://bazaar-vcs.org/Roadmap/BrisbaneCore">wiki</a> for those who want to see how it is progressing.<br /><br />To give the "big picture" overview, there are 2 primary changes in the new repository layout.<br /><ol><li>Changing how the inventory is serialized. (makes log -v 20x faster)</li><li>Changing how data is compressed. (means the repository becomes 2.5:1 smaller, bzr.dev now fits in 25MB down from 100MB, MySQL fits in 170MB down from 500MB)</li></ol>The effect of these changes is both much less disk space used (which also affects number of bytes transmitted for network operations), and faster delta operations (so things like 'log -v' are now O(logN) rather than O(N), or 20x faster on medium sized trees, probably much faster on large trees).<br /><br /><span style="font-size:130%;"><br /><span style="font-weight: bold;">Inventory Serialization</span></span><br /><br />The inventory is our meta-information about what files are versioned and what state each file is at, (git calls it a 'tree', mercurial calls it the 'changelog'). Before brisbane-core, we treated the inventory as one large (xml) document, and we used the same delta algorithm as user files to shrink it when writing it to the repository. This works ok, but for large repositories, it is effectively a 2-4MB file that changes on every commit. The delta size is small, but the uncompressed size is very large. So to make it store efficiently, you need to store a lot of deltas rather than fulltexts, which causes your delta chain to increase, and makes extracting a given inventory slower. (Under certain pathological conditions, the inventory can actually take up more than 50% of the storage in the repository.)<br /><br />Just as important as disk consumption, is that when you go to compare two inventories, we would then have to deserialize two large documents into objects, and then compare all of the objects to see what has and has not changed. You can do this in sorted order, so it is O(N) rather than O(N^2) for a general diff, but it still means looking at every item in a tree, so even small changes take a while to compute. Also, just getting a little bit of data out of the tree, meant reading a large file.<br /><br />So with brisbane-core, we changed the inventory layer a bit. We now store it as a radix tree, mapping between file-id and the actual value for the entry. There were a few possible designs, but we went with this, because we knew we could keep the tree well balanced, even if users decide to do strange things with how they version files. (git, for example, uses directory based splitting. However if you have many files in one dir, then changing one record rewrites entries for all neighbors, or if you have a very deep directory structure, changing something deep has to rewrite all pages up to the root.) This has a few implications.<br /><br />1) When writing a new inventory, most of the "pages" get to be shared with other inventories that are similar. So while conceptually all information for a given revision is still 4MB, we now share 3.9MB with other revisions. (Conceptually, the total uncompressed data size is now closer to proportional to the total changes, rather than tree size * num revisions.)<br /><br />2) When comparing two inventories, you can now safely ignore all of those pages that you know are identical. So for two similar revisions, you can find the logical difference between them by looking at data proportional to the difference, rather than the total size of both trees.<br /><br /><br /><span style="font-weight: bold;font-size:130%;" >Data Compression</span><br /><br />At the same time that we were updating the inventory logic, we also wanted to improve our storage efficiency. Right now, we store a line-based delta to the previous text. This works ok, but there are several places where it is inefficient.<br /><ol><li>To get the most recent text, you have to apply all of the deltas so far. Arguably the recent text is more often accessed than an old text, but it is the slower text to get. To offset this, you can cap the maximum number of deltas before you insert a fulltext. But that also affects your storage efficiency.</li><li>Merges are a common issue. As one side of the merge will have N deltas representing the changes made on that side. When you then merge, you end up with yet another copy of those texts. Imagine two branches, each changing 10 lines, when you merge them, if a delta could point at either parent, you could have a copy of 10 lines, but the other 10 lines looks like a new insert. Thought of another way, after a merge you have many lines that have existed in other revisions, but never in the same combination. Comparing against any single text would always be inefficient.</li><li>Cross file compression. As a similar issue to the 'single parent' in (2), there are also times when you have texts that don't share a common ancestry, but actually have a lot of lines in common. (Like all of the copyright headers)<br /></li></ol>There are lots of potential solutions for these, but the one we went with we are calling "groupcompress". The basic idea is that you build up a "group" of texts that you compress together. We start by inserting a fulltext for the most recent version of a file. We then start adding ancestors of the file to the group, generating a delta for the changes. The lines of the delta are then used as part of the source when computing the next delta. Once enough texts have been added to a group, we then pass the whole thing through another compressor (currently zlib, though we are evaluating lzma as a "slower but smaller" alternative).<br /><br />As an example, say you have 3 texts.<br /><pre>text1:<br />first line<br />second line<br />third line<br /><br />text2:<br /> first line<br /> modified second line<br /> third line<br /><br />text3:<br /> first line<br /> remodified second line<br /> third line<br /></pre><br />So if you insert text3 at the start, when you insert text2 you end up with a delta that inserts "first line" into the stream. When you get to text1, you then can copy the bytes for "first line" from text2, and "third line" from text3.<br /><br />There are a few ways to look at this. For example, one can consider that the recipe for extracting text1, is approximately the same as if you used a simple delta for text3 => text2, and then another delta for text2 => text1. The primary difference is that the recipe has already combined the two deltas together. The main benefit is that to extract text3, you don't have to create the intermediate text2.<br /><br />One downside to storing the expanded recipes is that there is some redundancy. Consider that both text2 and text1 will be copying "first line" from text3. In short examples, this isn't a big deal, but if you have 100s of texts in a row, the final recipe will look very similar to the previous one, and they will be copy instructions from a lot of different regions. (Development tends to add lines, so storing things in reverse order means those lines look like deletions. Removing lines splits a copy command into 2 copy commands for the lines before and the lines after.)<br /><br />A lot of that redundancy is removed by the zlib pass. By doing the delta compression first, you can still get good efficiency from zlib's 32kB window. The other thing we do is analyze the complexity of the recipe. If the recipe starts becoming too involved, we will go ahead and insert a new fulltext, which then becomes a source for all the other texts that follow. There are lots of bits that we can tune here. The most important part right now is making sure that the storage is flexible to allow us to change the compressor in the future, without breaking old clients.<br /><br /><br /><span style="font-weight: bold;"><span style="font-size:130%;">What now?<br /></span></span><br />The branch where the work is being integrated is available from:<br /><span style="font-family:courier new;">bzr branch <a href="https://code.launchpad.net/%7Ebzr/bzr/brisbane-core">lp:~bzr/bzr/brisbane-core</a><br />cd brisbane-core<br />make<br /><br /></span>For now, the new repository format is available as "<span style="font-family:courier new;">bzr init-repo --format=gc-chk255-big</span>", but is considered "alpha". Meaning it passes tests, but we reserve the right to change the disk format at will. Our goal is to get the format to "beta" within a month or so. At which point it will land in bzr.dev (and thus the next release) as a "--development" format (also available in the <a href="https://edge.launchpad.net/%7Ebzr-nightly-ppa/+archive/ppa">nightly ppa</a>). At that point, we won't guarantee that the format will be supported a year from now, but we guarantee to allow support converting data out of that format. (So if we release --development5, there will be an upgrade path to --development6 if we need to bump the disk format.)<br /><br />Going further, we are expecting to make it our default format by around June, 2009.<br /><br />At this point changes are mostly polish and ensuring that all standard use cases do not regress in performance. (Operations that are O(N) are often slower in a split layout, because you spend time bringing in each page. But if you can change them into O(logN) the higher constants don't matter anymore.)jamhttp://www.blogger.com/profile/17344213294371886790noreply@blogger.com3tag:blogger.com,1999:blog-4423175964608972068.post-26736872421383810782008-08-14T14:27:00.003-06:002008-08-14T14:30:31.501-06:00This Week in BazaarAh, to take a break from reporting to the world, but now we are back. This used to be a completely weekly series of posts about the on-going events in the world of Bazaar (and may be yet again). Written by co-authors John Arbash Meinel, one of the primary developers on Bazaar, and Paul Hummer, who works on integrating Bazaar into Launchpad.<br /><br /><br /><span style="font-weight: bold;font-size:180%;" >Bazaar 1.6rc3 Released<br /></span><br />With Martin Pool going on vacation for the next two weeks, John has stepped up to marshall 1.6 out the door. And he started with not 1 but 2 release candidates in 2 days. We're trying hard to get back into a time-based release schedule. The problem with sneaking in a feature-based release, is that they always end up slipping, as everyone tries to get "one-more-thing" in to the delayed release. However, with RC3, we've actually gotten the list of things that must be in 1.6 down to 0, so there is a very good chance it will become 1.6-final next week.<br /><br />Since it has been a delayed release, there are lots of goodies inside to partake of. Stacked Branches, improved Weave merge, significantly faster '<span style="font-family: courier new;">bzr log --short</span>', improvements to the Windows installation, better server side hooks, and the list goes on. Most of this we have mentioned in previous "This Weeks", the big difference is that it is available in a release, rather than just in the bzr.dev trunk.<br /><br />The Windows install is one of the major changes, in that it will now (by default) bundle TortoiseBzr as part of the standalone install. TortoiseBzr still needs work before it is as much of a joy to work with as the rest of the system, but this release is mostly about testing our ability to bundle them together.<br /><br /><br /><span style="font-weight: bold;font-size:180%;" >Looking forward to Bazaar 1.7<br /></span><br />As 1.6 nears it's official release, the development community has started planning the 1.7 development process. As it stands now, bzr 1.7 has a planned release date of September 8th. This means there are two whole weeks two get various bugfixes and contributions to bazaar in before getting down to release time (mentoring available).<br /><br />Among the proposed potential features, there are a few that really stand out. Mark Hammond has been polishing Bazaar on Windows, and there is much desire for someone to help getting the bazaar test suite to run cleanly in Mac OS X. These features will greatly add to the existing portability strengths of Bazaar. While the majority of changes needed are actually in the test suite, and not the core functionality, the community could really use someone who could step up, and learn how to do unit testing in Python. Bazaar 1.7 will also see some increased merge flexibilities, especially with criss cross merges.<br /><br />Improvements to the indexing layer are likely to land in 1.7, though as always, not on the default format. (We want at least 1 release supporting a format before we suggest it as the default, to give people time for compatibility.) The new b+tree layout for indexes makes them smaller (by approx 2:1) and makes them faster to search (eg, bzr log file being 3x faster).<br /><br />We also have a chance to land Group Compress, which has shown to compress repositories by as much as 3:1 over their current size. This change needs a bit more tweaking, though. There are generally tradeoffs between how much time you spend compressing, and how small the result is. And we want to make sure that we make the right tradeoffs. It is currently being evaluated as a test plugin.<br /><br /><br /><span style="font-weight: bold;font-size:180%;" >Bazaar Bug Day<br /></span><br />As Bazaar development speeds up, so do the incoming bugs. There are currently 1062 open bugs in Launchpad, and 287 of them have a "New" status, meaning they have not yet been triaged and categorized. At a past Bazaar sprint, a "bug day" was talked about, and it has been brought up again on the mailing list. Often, we fix many bugs and just haven't gotten around to marking them fixed. This is a great opportunity for members of the community who use Bazaar but don't directly develop it t o contribute back to the Bazaar community. You can help out by verifying bugs have been fixed, that they no longer exist, or that they still exist and provide more information on them. Come give your karma a boost, and help us squish some bugs!jamhttp://www.blogger.com/profile/17344213294371886790noreply@blogger.com3tag:blogger.com,1999:blog-4423175964608972068.post-89177683950511010192008-07-28T14:33:00.006-06:002008-07-30T10:08:44.108-06:00Last Week in BazaarWell, I'm late this week, so I'm officially marking this post as Last Week in Bazaar. In my defense, I got busy last Thursday, and then my cohort (Paul Hummer) flew off to New Zealand for a work-related sprint. So today, I (John Arbash Meinel, a developer on Bazaar) get to exercise full control over the content.<br /><br /><br /><span style="font-weight: bold;font-size:180%;" >Keyword Expansion<br /></span><br />People often request the ability to expand keywords, like they are used to in SVN and CVS. We've sort of postponed the implementation, because probably 90% of the time, it isn't really the right solution to the problem users are having. Also, they are kind of a mess in CVS anyway. Where I used to work we tried to use <span style="font-family:courier new;">$Id$</span> style expansion, only to find out that they conflict on every attempt at merging, and we started working hard to strip them out of our files. In a distributed VCS, you usually merge at least an order-of-magnitude more often, which also tends to reveal this problem.<br /><br />SVN at least works around the problem, in that when you commit, it actually strips the texts of their expanded keywords, so that the repository never stores the expanded form. And merges are also done on the non-expanded form. Which fixes that little problem. Though it introduces a couple others. Specifically, what you have on disk is not what sits in the repository, nor is it exactly what you will get out of a fresh checkout. The biggest reason is that if you commit revno 1354, it will update the tags of files that are touched. But if you checkout revno 1354 it will update the tags of *all* files. (I'm not positive on this, but I know there was a <a href="http://subversion.tigris.org/issues/show_bug.cgi?id=1743">bug</a> which was causing problems for people trying to do conversions. Because they couldn't quite find the right invocation to have 'update -r 1354' (from 1353) give the exact same tree as 'checkout -r 1354').<br /><br />The other reason keyword expansion is not usually what you want, is because it expands only for the given file. If you make a commit to 5 other files, the *tree* is at revno 1359, but the file with your:<br /><br /><span style="font-family:courier new;">my_version = "$Version$";</span><br /><br />Tag is still pointing at 1354. (Again, if 'svn update' would force all the tags to get re-expanded it might work correctly, though you run into performance problems expanding every keyword in every file on every update.) Bazaar has supported the <span style="font-family:courier new;">bzr version-info</span> command for a while, which lets you generate a version file (possibly from a template) which can store all the real details. Including the last-modified version for every file, whether any files in the working tree have been modified since commit, etc.<br /><br />The only case that I've really heard a good reason for keyword expansion is for a Website. Where each individual file is spread out into the world. So having a little "last modified" at the bottom can be pretty convenient. You also don't tend to have a "build" process which lets you generate the version information at that time.<br /><br />However, as Bazaar is meant to be a flexible system, Ian Clatworthy has done a wonderful job of adding the ability to support content munging via plugins. And has continued on to write a plugin specifically for expanding keywords.<br /><br /><a href="http://bazaar-vcs.org/KeywordExpansion">http://bazaar-vcs.org/KeywordExpansion</a><br /><br />So for all those people who feel they really need keyword expansion, look it up.I would imagine that once people get a good feel for it, and it matures a bit, it has a good chance to be brought into core code. Or at least make it into the release tarball as an "official" plugin.<br /><br /><br /><span style="font-weight: bold;font-size:130%;" >Open Source, Python, and Counting My Blessings<br /></span><br />Now onto something a bit more personal. This last week I had cause to re-visit an old library I had written, and try to get it up and running again. (Specifically, the project was <a href="https://launchpad.net/pydcmtk">pydcmtk</a>, python wrappers for the <a href="http://dicom.offis.de/dcmtk.php.en">Dicom Toolkit</a>.)<br />It took me several hours times several days just to get it to build and run the test suite again. All without changing any of the code. It was simply a build-step problem.<br />Which revealed a couple wonderful things about my current work:<br /><br /><ol><li>I get to work in Python. Which is a nice language, flexible, and *doesn't* need a build step. Not having to deal with C/C++ and all the complexities of getting dependencies built, with the right version of the compiler, and the right flags to the compiler.</li><li>Microsoft has a much harder time on their hands than Open Source does, at least when it comes to compatibility. Specifically, each version of their compiler comes with a different runtime. And code compiled for Visual Studio 7.1 doesn't like to work with the 8.0 objects nor the 9.0 objects. And they all have different msvcrtX.X.dll files. However, because the official method for getting your program to users is in binary (object) form, they have to provide ways to support your binary files for a long time. So in VS 8.0 they introduced a new step, which is to post-process your linked binaries with a manifest, declaring what runtimes they use. Further complicating this is that if you try to run a 8.0 compiled dll, it just gives an opaque "This process has tried to access the runtime incorrectly."<br />Not realizing this, I spent a long time comparing the exact compiler flags with other examples to fix it. (The boost build tool, bjam, knows how to do it, but there was a line "if exists foo.manifest: do stuff", which I originally read as "if not exists foo.manifest: create the manifest.")</li><li>Open source has generally handled the binary compatibility issue by punting and requesting software compatibility. And then you have a whole bunch of groups that spend their time recompiling everything for you (distributions like Ubuntu or Red Hat). And then they give you all the dependencies with a few simple commands (apt-get install zlib-dev dcmtk-dev boost-dev). On Windows, if you want to switch to developer mode, you generally have to grab the source code for all of those dependencies, and recompile them for your exact configuration.<br />Software-level compatibility is *much* easier to handle. Not the least of which because if something becomes incompatible you can fix it. (I remember a Microsoft memory <a href="http://www.joelonsoftware.com/articles/fog0000000054.html">issue</a>, where they had to switch in bug-for-bug compatibility because fixing it broke SimCity, how much better if they could have just patched SimCity.)<br />Binary compatibility (for C/C++) means that you can't even add members to structs, because then the size changes and malloc starts failing (plus members are referenced by offset, so adding something in the *middle* is a big no-no).<br />Source-level turns this way down into not removing things people are using. And, if something does change, with source-level you can even write a patch to fix the code. This does make it quite a bit harder for people who want to release binary-only packages, that they then don't have to modify for years. (Though when updating is a simple process, people are willing to do it more.)</li></ol><br /><span style="font-weight: bold;font-size:180%;" >1.6rc1 soon to come<br /></span><br />We are working on putting the final polish to stacked branches. We are trying to release something that people can feel comfortable using right away, and there are a few tricks to get there. (For example, bzr has a general policy of always preserving the source format when you do 'bzr branch'. It helps maintain compatibility within a project that hasn't chosen to upgrade to a newer format yet. However if you do 'bzr branch --stacked' that indicates you want to use the new feature, so we have to work out logic to create an upgraded target at the right time. This also turns out to conflict a bit with bzr-svn, which had its own logic to trick 'bzr branch' into not copying the source format.)<br /><br />You can already play with the Stacked Branches feature in the beta releases, but they'll appear much more polished in the final rc.jamhttp://www.blogger.com/profile/17344213294371886790noreply@blogger.com0tag:blogger.com,1999:blog-4423175964608972068.post-32581830299730017862008-07-17T13:42:00.008-06:002008-07-17T16:44:17.128-06:00This Week in BazaarWelcome back to the terrarium of the Bazaar distributed version control system. Written by co-authors John Arbash Meinel, one of the primary developers on Bazaar, and Paul Hummer, who works on integrating Bazaar into Launchpad as he refines his plans for world domination from his shiny new lair.<br /><br /><span style="font-weight: bold;font-size:180%;" >Bazaar 1.6b3 released<br /></span><br />The next beta release of Bazaar has just been cut, and is available at your local PPA:<br /><a href="https://launchpad.net/%7Ebzr/+archive">https://launchpad.net/~bzr/+archive<br /></a><br />The Windows installers should be available later today. This release provides lots of the shiny things that we've been talking about, like <a href="http://jam-bazaar.blogspot.com/2008/05/this-week-in-bazaar_29.html">Stacked Branches</a>, <a href="http://jam-bazaar.blogspot.com/2008/07/this-week-in-bazaar_10.html">Real Weave Merge</a>, more hooks for server-side operation, and lots of bug fixes and general polishing. The full UI for using stacked branches still needs a little bit of polishing, so the feature is not enabled by default. The functionality is all there, and if you are interested, we'd love to hear from you (kudos and complaints are equally welcome).<br /><br /><br /><span style="font-weight: bold;font-size:180%;" >New updates to Gnome Bazaar Playground<br /></span><br />Coming back from a very productive trip to <a href="http://guadec.expectnation.com/public/content/main">Guadec</a>, Tim Penhey has been overseeing some customizations to the Bazaar Playground for Gnome. All of the branches created at the local server in Turkey for Guadec have been added to the public playground. The <a href="https://launchpad.net/loggerhead">Loggerhead</a> installation has received some TLC by way of customizations to the UI. <a href="http://bzr-playground.gnome.org/accerciser/">Accerciser's playground page</a> is a good demonstration af the UI changes that have been made. The playground is actively being used by applications such as <a href="http://bzr-playground.gnome.org/brasero/">Brasero</a>, <a href="http://bzr-playground.gnome.org/jhbuild/">jhbuild</a>, <a href="http://bzr-playground.gnome.org/metacity/">Metacity</a> and more.<br /><br />One of the fun results of meeting with people at Guadec, is that it showed ways to improve Loggerhead when dealing with lots of projects and lots of branches. Work is continuing to make customizing Loggerhead's look-and-feel easier, and providing better tools for creating these "Bazaar Playgrounds" to use in evaluating Bazaar. The Bazaar developers are committed with making tools easier to use, and making the process as simple and powerful as possible.<br /><br /><span style="font-weight: bold;font-size:180%;" >Up and Coming Repository Format Updates<br /></span><br /><a href="http://www.advogato.org/person/robertc/diary.html">Robert Collins</a> has been hard at work to refine how Bazaar stores its history information. We all like to have deep context, but we don't like to have to pay the penalty of downloading all of that context. Because Bazaar has a flexible repository structure, Robert has been able to play with changing the on-disk structure without major surgery to the rest of the code.<br /><br />First is a change to how <a href="https://code.launchpad.net/%7Elifeless/+junk/bzr-index2">indexes</a> are written, switching from a bisectable list to a btree structure. This paged structure allows us to compress the indexes, making them smaller, and faster to process remotely. It also reduces the number of lookups to find a key. (On average, a bisect search is log<sub>2</sub>N, while the btree is closer to log<sub>100</sub>N.) At the moment, he is testing this with a shared repository containing all of the projects available from in the Ubuntu apt repositories. This weighs in at around 13k branches, and somewhere around 20GB of disk space used.<br /><br />Second is an update to how texts are stored. At the moment we use a simple format which places fulltexts periodically, and then stores deltas against those fulltexts. It has served us rather well, but can be improved upon. With his <a href="https://code.launchpad.net/%7Elifeless/+junk/bzr-groupcompress">Group compress</a> work, we can see a savings of as much as 2x-3x. Further, the data is stored such that you can do simple linear reads to get the base fulltext and all deltas necessary to generate a given fulltext. This reduces the pressure on indices, as you don't have to search for base texts. (Instead you just store a pointer to the start, and give the total length that needs to be read.)<br /><br />These are still in development phase, but a format that uses them will likely appear in the next release (bzr 1.7).<br /><br /><span style="font-weight: bold;font-size:180%;" >Community Agile<br /></span><br />Ian Clatworthy has recently released a wonderful <a href="http://ianclatworthy.wordpress.com/2008/07/09/announcing-the-community-agile-project/">document</a> describing the workflow we (generally) use at Canonical. It describes how basic practices are similar to, and different from, other systems like <a href="http://en.wikipedia.org/wiki/Agile_software_development">Agile</a>. The biggest (IMO) being a recognition that the community surrounding your project is one of the strongest and most important pieces. This has always been true in software development, but it has traditionally been somewhat hidden. Open Source has exposed just how powerful the community can be. For people interested in <span style="font-style: italic;">how</span> software can be developed, rather than just <span style="font-style: italic;">what</span>, I certainly recommend it.jamhttp://www.blogger.com/profile/17344213294371886790noreply@blogger.com0tag:blogger.com,1999:blog-4423175964608972068.post-60253100368236249012008-07-10T14:05:00.005-06:002008-07-10T14:10:45.572-06:00This Week in BazaarHere we are again, bringing you the gossip and dirty secrets in the development world of the Bazaar distributed version control system. In this, the 10th week, the series is now under new management, with co-authors John Arbash Meinel, one of the primary developers on Bazaar, and Paul Hummer, who works on integrating Bazaar into Launchpad.<br /><br /><br /><span style="font-size:180%;"><span style="font-weight: bold;">Bundle Buggy</span></span><br /><br /><a href="http://code.aaronbentley.com/">Aaron Bentley</a> has once again been improving his wonderful <a href="http://bundlebuggy.aaronbentley.com/">Bundle Buggy</a>. He just introduced support for multiple projects using a single instance of Bundle Buggy. There are now 5 Bazaar projects using the main bundle buggy instance. (Bazaar, bzr-gtk, Bundle Buggy itself, Bzrtools, and PQM.) Of course, Daniel Watkins has made excellent use of his time, and has managed to crank out lots of updates for PQM. At this point it is code clean up, reducing the dependencies making it easier to set up and install.<br /><span style="font-weight: bold;font-size:180%;" ><br />Bazaar playground for Gnome</span><br /><br />Originally, John Carr set up Bazaar mirrors of all the Gnome modules, which people could then use as a starting point for publishing code and collaborating. This week, the <a href="http://bzr-playground.gnome.org/">Bazaar playground for gnome</a> was created so that any Gnome developers could be involved in pushing, branching, and sharing code through bazaar. This new server runs Loggerhead for viewing the code committed to these Bazaar branches. <a href="http://bzr-playground.gnome.org/damned-lies/">Damned Lies</a> is also set up on the playground. This server was also reproduced locally at GUADEC because of the flaky internet connection at the conference, and all those local branches will be moved to the playground shortly.<br /><br /><br /><span style="font-weight: bold;font-size:180%;" >Weave merging and handling "interesting" history<br /></span><br />One of the great things about having a large project like MySQL using your software is that they push and stretch you in ways that you haven't necessarily encountered before. Specifically, their branch workflow looks a bit like a pile of spaghetti. With several long-term maintenance branches, team branches based off of that, and individual developer branches based off of that. Patches have a tendency to travel in unexpected ways (you may go user => team => release 1 => release 2, or you might go release 1 => team => team-2 => release 2, etc). They also are very fond of 'null merging' patches that aren't relevant to the next release. They merge the change and revert the text changes and commit.<br /><br />Bazaar supports all of this, but it exposes <a href="http://revctrl.org/CrissCrossMerge">weaknesses</a> in simple 3-way merge logic. Because patches don't flow in anything considered orderly, you don't have the opportunity to select a "clean" base very often. Bazaar has long had an option for doing a <a href="http://bazaar-vcs.org/BzrWeaveFormat">"--weave" merge</a>. It didn't receive much attention for a while, and had become rather slow. It turned out to be a good fit for MySQL's workflow, so John has spent a bit of time recently to make the functionality efficient and correct in some specific edge cases. Expect the improvements to show up in the next release.jamhttp://www.blogger.com/profile/17344213294371886790noreply@blogger.com1tag:blogger.com,1999:blog-4423175964608972068.post-10315080505755586522008-07-03T14:30:00.003-06:002008-07-03T14:53:21.187-06:00This Week in BazaarThis is the 9th in a series of posts about current topics in the development world of the Bazaar distributed version control system. The series is co-authored by John Arbash Meinel, one of the primary developers on Bazaar, and Elliot Murphy, unlicensed health professional. This week we are joined by Paul Hummer, who works on integrating Bazaar into Launchpad.<br /><br /><span style="font-weight: bold;font-size:180%;" >How to integrate bzr into your build and release process</span><br /><br />Once you are happily using bzr on your project, the next step is some basic integration into your build process. A common desire is getting revision number to store during build process, so that you can tell what revision your program was built with. This is easy to do with '<span style="font-family:courier new;">bzr revno</span>', which prints the current revision number. Thats not very exciting though.<br /><br />There is a much more sophisticated command in bzr called version-info. For example, running:<br /><pre> bzr version-info --custom \<br /> --template="#define VERSION_INFO \"Project 1.2.3 (r{revno})\"\n"</pre><br />Will produce a C header file with formatted string containing the current revision number. Other supported variables in the templates are: date, build date, revno, revision id, branch nickname, and clean (which shows whether the tree contained uncommitted changes). This makes integrating into make or another build system very easy. The templates make it very easy to generate a version file for whatever language you are writing in.<br /><br />What else could be automated other than version info? The <a href="https://launchpad.net/bzr-stats">bzr-stats</a> plugin has a credits command. This is useful for getting a list of contributors to fill out a credits page, easter egg, etc. Also, changelogs can be generated with the <a href="http://telecom.inescporto.pt/%7Egjc/gnulog.py">gnulog plugin</a>.<br /><br />Andrew Bennetts has been working on a new server side push hook that can be used to run tests before allowing a push to complete. Wow, this could replace <a href="https://launchpad.net/pqm">PQM</a>! Well, not quite. This is more of a poor-man's PQM. It doesn't scale as well, but would work for smaller teams that don't necessarily need PQM. Blocking push while tests are running is not a good idea if you have a very long test suite, and PQM will merge and commit, making it easier to deal with multiple people trying to merge changes at the same time. If you're working in a very small group (1-3 people) with a smaller test suite, using these hooks might be just the trick, but for a larger work group you should still set up PQM.<br /><br />Right now PQM is a fair amount of work to set up, but that should be changing soon. Daniel Watkins has started work on making PQM easier to set up and use, and others have been submitting cleanup patches too.<br /><br />Finally, if you are using bzr on a project that builds .deb packages, check out the <a href="https://edge.launchpad.net/bzr-builddeb">builddeb</a> plugin. It would be great to have plugins for other packaging tools as well! RPM, MSI, JAR, WAR, etc.jamhttp://www.blogger.com/profile/17344213294371886790noreply@blogger.com0tag:blogger.com,1999:blog-4423175964608972068.post-24505280377430276392008-06-26T14:24:00.004-06:002008-06-26T16:33:17.219-06:00This Week in BazaarThis is the eighth (wow, 2 whole months of solid updates, yipee!) in a series of posts about current topics in the development world of the Bazaar distributed version control system. The series is co-authored by John Arbash Meinel, one of the primary developers on Bazaar, and Elliot Murphy, who drinks the rain. This week we are joined by Martin Albisetti, talking about Loggerhead, and dreaming of a cold pint.<br /><br /><span style="font-size:180%;"><span style="font-weight: bold;">bzr-search, loggerhead, gnome, and you</span><br /></span><br />Robert Collins recently published his awesome <a href="http://www.advogato.org/person/robertc/diary/87.html">bzr-search</a> plugin, and John Carr has been doing a lot of work on setting up a <a href="http://blogs.gnome.org/johncarr/">bzr mirror of Gnome</a>. A neat search module and a bunch of source trees is just begging to be combined in some sort of web interface!<br /><br />There are a few web front ends for Bazaar at the moment, such as Loggerhead, webserve, viewbzr, and bzrweb. Today we are going to be focusing on <a href="http://www.lag.net/loggerhead">Loggerhead</a> (you can also go to its <a href="https://launchpad.net/loggerhead">Launchpad project page</a> to watch the development activity). It is probably the one with the most active development at the moment. An installation of the latest stuff in action is available at the <a href="http://bzr-mirror.gnome.org:8080/banshee/trunk/changes">bzr mirror of Gnome</a>. Loggerhead shows side-by-side diffs, has RSS feeds, and lets you download specific changes, just like you would expect.<br /><br />You can get the latest version of it yourself by doing:<br /> <span style="font-family:courier new;">bzr branch lp:loggerhead</span><br />You'll need python-simpletal and python-paste. Then by running "<span style="font-family:courier new;">serve-branches.py</span>" in the directory where you're branches live, you should be up and running with your own web interface. Eventually <span style="font-family: courier new;">serve-branches.py</span> is to expected to become a bzr plugin which will let you easily serve your branches with a single bzr command.<br /><br />We hinted at it above; recent versions have started integrating with bzr-search. So for branches that you've run "<span style="font-family:courier new;">bzr index</span>" on, it can give hints in the search dialog, and quickly find revisions that match your search terms. You can try it yourself by just typing a few letters into the <a href="http://bzr-mirror.gnome.org:8080/banshee/trunk/changes">search dialog</a>.<br /><br />In the coming weeks, Loggerhead will be getting a bit of a face lift with a new theme to make its externals as shiny and new as its internals.<br /><br />So give it a poke, and send any feedback to either <a href="mailto:bazaar@lists.canonical.com">bazaar@lists.canonical.com</a>, or <a href="https://bugs.launchpad.net/loggerhead">https://bugs.launchpad.net/loggerhead</a>.jamhttp://www.blogger.com/profile/17344213294371886790noreply@blogger.com0tag:blogger.com,1999:blog-4423175964608972068.post-29659830595314215862008-06-19T13:58:00.004-06:002008-06-19T15:43:59.321-06:00This Week in BazaarThis is the seventh in a series of posts about current topics in the development world of the Bazaar distributed version control system. The series is co-authored by John Arbash Meinel, one of the primary developers on Bazaar, and Elliot Murphy, who is sentimental today.<br /><br /><span style="font-weight: bold;font-size:180%;" >MySQL Switches to Bazaar</span><br /><br />Very big news for the Bazaar team today, as <a href="http://blogs.mysql.com/kaj/2008/06/19/version-control-thanks-bitkeeper-welcome-bazaar/">MySQL announces switching from Bitkeeper to Bazaar</a>.<br /><br />One of the things that was important in doing this conversion was doing a very high quality import of all the existing history. John did a great job working on that, and even added a new feature to Bazaar and bzr-gtk to enable this: per-file commit messages. Since per-file commit messages had been used for years in the MySQL code base, it was not acceptable to lose them, and none of the DVCS systems under consideration supported these messages. Although this feature is debated by some, it was important to preserve that history, and so support for per-file commit messages was added to Bazaar in a non-invasive way, where projects who wanted to use them could, but existing projects were not forced to adopt them. At the moment, to enter per-file commit messages you need to use the <a href="http://bazaar-vcs.org/bzr-gtk">bzr-gtk</a> GUI commit tool, but we'd love it if someone came up with a clean way to enable this in the standard CLI also.<br /><br />It was also important to have a smooth transition period that did not interrupt delivering MySQL releases. This meant we needed a stable importer where the imports could be periodically refreshed without causing all of the developers around the world working on the project to re-download all their trees. At one point we were doing continuous imports of over 30 trees.<br /><br />It's been a fun and challenging project providing support to MySQL during this time. Although we're really excited about this milestone, we still have plenty of work to do. Here are a few things we've learned, where we are working to make Bazaar even better.<br /><br />Stacked Branches. We've talked <a href="http://jam-bazaar.blogspot.com/2008/05/this-week-in-bazaar_29.html">previously</a> about stacked branches, and for a project like MySQL this new feature will make uploading a new branch to Launchpad much faster.<br /><br />Merging - Bazaar has several good merge algorithms, but we still have some ideas to make merging go even smoother, particularly for some of the complicated ancestries that MySQL has. All merge algorithms have their own set of trade offs, edge cases that they handle better or worse than other algorithms.<br /><br />We also need to continue to add GUI tools, and make further enhancements to existing tools. If you are looking for a valuable way to contribute to Bazaar, try lending a hand to one of the Bazaar GUI projects.<br /><br />Last week we asked about bzr screencasts, and James Westby told me about a screencast that he recorded - if anyone else is interested in getting involved in producing a series of screencasts, please do let us know.jamhttp://www.blogger.com/profile/17344213294371886790noreply@blogger.com2tag:blogger.com,1999:blog-4423175964608972068.post-53840855001263746352008-06-11T13:58:00.004-06:002008-06-11T14:05:39.515-06:00This Week in BazaarThis is the sixth in a series of posts about current topics in the development world of the Bazaar distributed version control system. The series is co-authored by John Arbash Meinel, one of the primary developers on Bazaar, and Elliot Murphy, who just wants a nice story and a nap.<br /><br /><span style="font-weight: bold;font-size:130%;" >1.6 on the way<br /></span><br />We decided to change the release process a bit for the bzr 1.6 release. We're introducing a bit more than normal in this relase (such as <a href="http://jam-bazaar.blogspot.com/2008/05/this-week-in-bazaar_29.html">Stacked Branches</a>), so we've decided to delay the final release a couple of weeks to ensure that everything gets an extra coat of polish. We've already had 2 beta releases, which are available in the <a href="https://launchpad.net/%7Ebzr-beta-ppa/+archive%3Ebeta%20ppa%3C/a%3E%20for%20Ubuntu%20distributions.%20Or%20you%20can%20download%20the%20%3Ca%20href=" net="" bzr="" 6="" 6beta2="">Source</a>.<br /><br />Please give it a poke and let us know what you think.<br /><br /><br /><span style="font-size:130%;"><span style="font-weight: bold;">Diff and Merge Tools</span></span><br /><br />When you start working with other people on a project, you need some way of seeing what code has changed, doing code reviews, resolving conflicts, etc. The 'bzr diff' command has a '--using=foo' argument that allows you to plug in your favorite diff/merge tool if you don't want the built-in text based diff. You can also add an alias for your favorite tool. For example, Elliot uses meld all the time, so he has 'alias mdiff=diff --using=meld'. You also might want to install the <a href="https://edge.launchpad.net/bzr-difftools">difftools</a> plugin, which adds some smarts to Bazaar about whether a particular tool understands how to diff a full tree or needs to handle the files one at a time. Here are some of the more interesting diff tools that you might want to try out:<br /><br /><ul><li><a href="http://meld.sourceforge.net/">Meld</a></li><li><a href="http://kdiff3.sourceforge.net/">Kdiff3</a></li><li><a href="http://www.vim.org/">vimdiff</a></li><li>Wikipedia <a href="http://en.wikipedia.org/wiki/Comparison_of_file_comparison_tools">lists many more file comparision tools</a></li></ul><br />One technique for easily reviewing a lot of incoming code is to keep around a pristine branch of your project that you use for conducting reviews. You can apply a patch to the tree, then run 'bzr mdiff' (or your own favorite tool), and take a look at all the changes in the patch with a lot more context than is included in the patch itself. This also gives you a spot to run the automated tests for that project, see if it compiles, etc. Once you are done with the review you can simply 'bzr revert' to get back to a clean tree and move on to the next patch to be reviewed.<br /><br />Another neat trick is to use the 'merge --preview' switch. You might want to use this command to take a look at any conflicts that might have been introduced if there have been changes since the patch was generated. It shows you the patch of exactly what would be merged into the branch at that moment in time, which can sometimes have differences from what you would be reviewing by reading the patch.<br /><br />Another interesting (but commercial) tool is <a href="http://changesapp.com/">Changes.app</a>. It is a Mac OS X client which integrates with Finder and provides a comparison tool. It has direct support for Bazaar as well as several other version control systems.<br /><br /><br /><span style="font-weight: bold;font-size:130%;" >Screencasts</span><br /><br />Screencasts are becoming a very popular way to show people how to use your fancy tool, and we'd like to get some volunteers to help with putting together some screencasts explaining how to use various parts of bzr and related tools. If you want to help with this, email elliot at canonical dot com. The great thing about screencasts is that they use a different avenue for conveying information (audio, motion, etc) so while it won't replace a written tutorial, it is a wonderful supplement.jamhttp://www.blogger.com/profile/17344213294371886790noreply@blogger.com1tag:blogger.com,1999:blog-4423175964608972068.post-16707881673416878872008-06-05T14:16:00.005-06:002008-06-05T14:42:13.518-06:00DVCS Comparison: On mainline merges and fast forwards<a href="http://vcscompare.blogspot.com/2008/06/on-mainline-merges-and-fast-forwards.html">DVCS Comparison: On mainline merges and fast forwards</a> has a discussion about whether 'fast forward' is a "better" method for merging in a distributed topology.<br /><br />I can understand where he is coming from, and we respect that some users prefer other workflows. Bazaar even has direct support for 'fast forward' with '<span style="font-family:courier new;">bzr merge --pull</span>', and with our aliasing functionality, you can set:<br /><span style="font-family:courier new;"><br />[ALIASES]</span><br /><span style="font-family:courier new;"> merge = merge --pull</span><br /><br />In <span style="font-family:courier new;">~/.bazaar/bazaar.conf</span> and change the default meaning of '<span style="font-family:courier new;">bzr merge</span>'. However, I still fall of the side of the fence that fast forward should not be the default.<br /><br />I can agree that if you have 2 people collaborating on <span style="font-style: italic;">the same feature</span> that you would want fast forward. Though I would argue that is because they are effectively working on the <span style="font-style: italic;">same branch</span>. For my personal workflow, I have a different alias set:<br /><br /><span style="font-family:courier new;">log = log --short -r -10..-1 --forward</span><br /><br />What this means is that when I type 'bzr log' I see just the mainline commits of a branch, without the merge cruft. (Where I define the merge cruft as the individual revisions that make up a feature change, not the 'merge foo' node.)<br /><br />Take this view of bzr.dev:<br /><pre>3466 Canonical.com Patch Queue Manager 2008-06-02 [merge]<br /> (jam) Give Aaron the benefit of bug #202928<br /><br />3467 Canonical.com Patch Queue Manager 2008-06-03 [merge]<br /> (Martin Albisetti) Better message when a repository is locked.<br /><br />3468 Canonical.com Patch Queue Manager 2008-06-03 [merge]<br /> (mbp) merge 1.6b1 back to trunk<br /><br />3469 Canonical.com Patch Queue Manager 2008-06-04 [merge]<br /> (mbp) Update more users of default file modes from control_files to bzrdir<br /><br />3470 Canonical.com Patch Queue Manager 2008-06-04 [merge]<br /> (Jelmer) Move update_revisions() implementation from BzrBranch to<br /> Branch.<br /><br />3471 Canonical.com Patch Queue Manager 2008-06-04 [merge]<br /> (vila) Split a test<br /><br />3472 Canonical.com Patch Queue Manager 2008-06-04 [merge]<br /> (jam) Fix bug #235407, if someone merges the same revision twice,<br /> don't record the second one.<br /><br />3473 Canonical.com Patch Queue Manager 2008-06-05 [merge]<br /> Isolate the test HTTPServer from chdir calls (Robert Collins)<br /><br />3474 Canonical.com Patch Queue Manager 2008-06-05 [merge]<br /> Add the 'alias' command (Tim Penhey)<br /><br />3475 Canonical.com Patch Queue Manager 2008-06-05 [merge]<br /> (mbp) #234748 fix problems in final newline on Knit add_lines and<br /> get_lines<br /></pre>You get to see a nice short summary of everything that has been happening (in proper chronological order.) Admittedly, seeing "Patch Queue Manager" on each of those commits is less optimal (which is why we add the author names.) That is just a temporary limitation of our PQM. Bazaar already supports setting an <span style="font-family: courier new;">--author</span>, separate from the committer, we just need to teach our integration bot to use it.<br /><br />The big difference, IMO, is whether you are bringing in someone else's changes to enhance your work, or whether you are collaborating on the same item. I would argue that collaborating on the same item is slightly less common. It also depends what you <span style="font-weight: bold;">do</span> with the merge commits. Just saying "merge from branch A" is certainly not helpful. But when you can say "merge Jerry's changes to Command.foo", it can indeed be helpful when tracking back through and figuring out where and when "foo" changed, without being lost in the forest for having too many trees.jamhttp://www.blogger.com/profile/17344213294371886790noreply@blogger.com0tag:blogger.com,1999:blog-4423175964608972068.post-73745729316041804352008-06-05T14:02:00.005-06:002008-06-05T14:12:44.293-06:00This Week in BazaarThis is the fifth in a series of posts about current topics in the development world of the Bazaar distributed version control system. The series is co-authored by John Arbash Meinel, one of the primary developers on Bazaar, and Elliot "fresh needle" Murphy. The two topics for this week are not related, but it's our blog and we get to write what we want.<br /><br /><span style="font-weight: bold;font-size:180%;" >Hosting of Bazaar branches</span><br /><br />One of the first questions people ask when moving to Bazaar is "Where can I host my branches?" Even with distributed revision control, it is often handy to have a shared location where you publish your code, and merge code from others. Canonical has put a lot of work into making launchpad.net an excellent place to host code, but there are many other options available.<br /><br />Because bazaar supports "dumb" transports like sftp, you can publish your branches anywhere that you can get write access to some disk space. For example, sourceforce.net gives projects some web space with sftp access, and you can easily push branches up over sftp. It's also easy to use bzr on any machine that you have ssh access to, you don't even need to install bazaar on the remote machine.<br /><br />As bazaar is a GNU project, we've been working with the <a href="http://savannah.gnu.org/">Savannah</a> team to enable bazaar hosting on Savannah also.<br /><br />Another option is serving bazaar branches over HTTP. You can do this for both read and write access, and there is a great <a href="http://doc.bazaar-vcs.org/bzr.dev/en/user-guide/index.html#serving-bazaar-with-fastcgi">HOWTO</a> in the bazaar documentation. Do you know of anywhere else that is offering Bazaar hosting? Let us know in the comments!<br /><br /><br /><span style="font-weight: bold;font-size:180%;" >Bazaar review and integration process</span><br /><br />How do you ensure high quality code, when working on a fast moving codebase in a widely distributed team? Here are some things that we've been doing with the Bazaar project, and we think they are useful practices for most projects.<br /><br /><br /><span style="font-size:130%;"><span style="font-weight: bold;">Automated Test Suite</span><br /></span><br />One very important key towards having a stable product is proper testing. As people say "untested code is broken code". In the Bazaar project, we recommend that developers use <a href="http://en.wikipedia.org/wiki/Test-driven_development">Test Driven Development</a> as much as possible. However, what we *require* is that all new code has tests. The reason it is important for the tests to be automated, is because it transfers knowledge about the code base between developers. I can run someone else's test, and know if I conformed to their expectations about what this piece of code should do.<br /><br />This actually frees up development tremendously. Especially when you are doing large changes. With a good test suite, you can be confident that after your 2000 line refactoring, everything still works as before.<br /><br /><br /><span style="font-weight: bold;font-size:130%;" >Code Review<br /><br /></span>Having other people look at your changes is a great way to catch the little things that aren't always directly testable. It also helps transfer knowledge between developers. So one developer can spend a couple weeks experimenting with different changes, and then there is at least one other person who is aware of what those are.<br /><br />The basic rules for getting code merged into Bazaar is that:<br /><ol><li>It doesn't reduce code clarity</li><li>It improves on the previous code</li><li>It doesn't reduce test coverage</li><li>It must be approved by 2 developers who are familiar with the code base.</li></ol>We try to apply those rules to avoid hitting the rule "The code must be perfect before it is merged", and the associated project stagnation. Code review is a very powerful tool, but you have to be cautious of "oh, and while you are there, fix this, and that, and this thing over here." Sometimes that is useful to catch things that are easy (drive-by fixes). It can also lead to long delays before you actually get to see the improvements from someone's work, and long delays are demotivating.<br /><br />Item number 3 is a pragmatic way to approach how much testing is required. In general, the test coverage should improve, jsut like the code quality. But that doesn't mean you have to test all code 100 different ways.<br /><br /><br /><span style="font-weight: bold;font-size:130%;" >Integration Robot (<a href="https://launchpad.net/pqm">PQM</a>)<br /></span>Now that you have a good automated test suite, and proper code reviews, the next step is to make sure that you have a version of the code base which has all the tests passing. Often when developing a new feature, it is quite reasonable to "break" things for a bit, while you work out just how everything should fit together. Requiring the tests to pass at each level of development puts an undue burden on developers, preventing them from publishing their changes (to get feedback, to snapshot their progress, etc.) Very often I commit when things are still somewhat broken, as it gives me a way to 'bzr revert' back to something I wanted to try.<br /><br />However, you don't want the official releases of your project to have all of these broken bits in them. The Bazaar project uses a "Patch Queue Manager". Which is simply a program that responds to requests to merge changes. When your patch has passed the review stages, you submit it to the PQM, which grabs your changes, applies them, runs the full test suite, and commits the changes to "mainline" if everything is clean.<br /><br />The reasons to use a robot are:<br /><ol><li>Humans are very tempted to say "ah, this is a trivial fix, I'll just merge it". Without realizing there is a subtle typo or far-reaching effect. When you have a large test suite, it can often take a while to run all the tests (the bzr.dev test suite runs in 5-10 minutes, but some projects have test suites that take hours.) Having a program doing the work means a human is relieved of the tedium of checking it.</li><li>There is generally only a single mainline, but there may be 50 developers doing work on different branches. When they all want to merge things, it isn't feasible to require the users to run the test suite with the latest development tip. If the development pace is fast enough versus the time to run the test suite, you can get into an "infinite loop". Where you merge in the latest tip, wait for the tests to pass, and by the time you go to mark this as the new mainline tip, someone else beat you to it. And you go around again. PQM does this for you in a fairly natural way.</li><li>Running in a "clean" environment is a safety net for when you forget about a new dependency that you added.</li><li>There are similar ways to do this, such as <a href="http://cruisecontrol.sourceforge.net/">Cruise Control</a> for Subversion. There is one key difference, though. With Cruise Control, you find out after the fact that your mainline has broken. With PQM, we know that every commit to the mainline of Bazaar passed the test suite at that time. This helps a lot when tracking down bugs. It also helps with "dogfooding"...<br /></li></ol><span style="font-weight: bold;font-size:130%;" ><br />Dogfooding</span><br /><br />If you want people to do regular testing of the development version, it must be easy to run different versions of the project without needing a complex install. Bazaar does this by being runnable directly out of the source tree, without any need to set $PYTHONPATH or mess around with installing different versions. You can also easily change out the plugins that are loaded using the BZR_PLUGIN_PATH env variable. This means that developers can run the latest development version, and easily switch to a particular version when trying to reproduce a bug or help a user.<br /><br />By having the PQM running the test suite, developers can run on the bleeding edge, and know that they won't get random breakage. It is always possible that something will break, but the chance is quite low. (In 2000 or so commits since we started using PQM, I believe bzr.dev has never been completely unusable, and has had < 5 revisions which we would not recommend people use.)<br /><br />It also means that you can be fairly confident in creating a release directly from your integration branch (mainline).jamhttp://www.blogger.com/profile/17344213294371886790noreply@blogger.com0tag:blogger.com,1999:blog-4423175964608972068.post-42883513313303633122008-05-29T13:48:00.002-06:002008-05-29T14:07:07.137-06:00This Week in BazaarThis is the fourth in a series of posts about current topics in the development world of the Bazaar distributed version control system. The series is co-authored by John Arbash Meinel, one of the primary developers on Bazaar, and Elliot Murphy, imaginary boy and part-time impostor.<br /><br /><br /><span style="font-weight: bold;font-size:130%;" >Stacked branches<br /></span><br />Some projects are very big with lots of files and lots of history. Many projects want to maintain the policy that development is done on independent branches, which are then merged back when the development is complete. However, the overhead of downloading, branching, and uploading the full history is prohibitive. There are a couple of different ways to solve this problem.<br /><br />Dealing with a large branch can be split into two problems: downloading and uploading.<br /><br />Bazaar has had a storage optimization called shared repositories for quite a while. This serves to dramatically reduce the amount of data downloaded for the second, third, etc branches of a project. A shared repository is a big pool of revisions which multiple branches point to. When you grab a new branch into a shared repository, bzr figures out how much of the history it already has, and only downloads the new revisions. So the first branch of a large project transfers most of the data, and grabbing additional branches is very cheap. In extreme cases, like working on a multi-gigabyte project from a 56k dial-up connection, you could even do things like distribute the initial data on a DVD to prime the shared repository, and then the user only needs to download incremental changes.<br /><br />This technique can also be used for solving the uploading problem. If the upload location uses a shared repository, then uploading a new branch can just copy the new data. The problem with this, is once you start introducing multiple users, who decide that they may not want to give access to other people to push data into their repository.<br /><br />Another approach to minimizing the data uploaded is called server side forking, and you can see a nice implementation of this on github.com. The user places a request with the code host to do the copy for them, and when it finishes, they have their own location already primed with the current branch.<br /><br />The Bazaar project is approaching it in a different way. If some data is already public, then you can just reference the other public location when you start uploading your new branch. The first steps in this direction are being termed "Stacked Branches". Basically, instead of requiring all branches to contain the full history, you are allowed to "Stack" a branch on top of another. Because the uploader does not have write access to the lower levels of the stack, this addresses the security risks of shared repositories.<br /><br />Stacking also opens up possibilites for the "download" side of the equation. For many users, they don't need a very deep copy of history to get their work done. If there is a location that can be trusted to be available when they need it, they can copy just the tip revisions. Which would allow them to do most of their work (commit, diff, etc) without consulting the remote host. And when they need more information (such as during a complex merge), the bzr client is able to fall back to the original source to get any needed information.<br /><br />The goal of all this is to make it very easy to start working with a large project, while still making all the history available in a meaningful way. The bulk of this work has been <a href="http://people.ubuntu.com/%7Erobertc/baz2.0/shallow-branch/">completed</a>, and it is likely that it will land in <a href="https://launchpad.net/bzr/+milestone/1.6">bzr 1.6</a> (to be released in a couple of weeks.)jamhttp://www.blogger.com/profile/17344213294371886790noreply@blogger.com4tag:blogger.com,1999:blog-4423175964608972068.post-68010275241874546902008-05-22T14:23:00.005-06:002008-05-22T14:45:19.697-06:00This Week in BazaarThis is the third in an amazingly regular weekly series of posts about current topics in the development world of the Bazaar distributed version control system. The series is co-authored by John Arbash Meinel, one of the primary developers on Bazaar, and Elliot Murphy, Launchpad developer and relentless agitator. This week we have a special guest, Jelmer Vernooij, Samba developer, and author of the <a href="http://bazaar-vcs.org/BzrForeignBranches/Subversion">Bazaar Subversion</a> plugin.<br /><br />In last week's episode, our fearless explorers braved the new world of <a href="http://jam-bazaar.blogspot.com/2008/05/this-week-in-bazaar.html">plugins</a>. Today we will focus on a specific plugin, and talk about how you can use Bazaar with Subversion. Earlier this week there was a very nice <a href="http://google-opensource.blogspot.com/2008/05/develop-with-git-on-google-code-project.html">blog post</a> about using Git with the Subversion servers on Google Code Hosting, and plenty of interesting discussion afterwards.<br /><br /><br /><span style="font-weight: bold;font-size:130%;" >Rationale</span><span style="font-size:130%;"><br /></span><br />If you have <a href="http://bazaar-vcs.org">Bazaar</a> installed, why would you want to work with <a href="http://subversion.tigris.org/">Subversion</a>? Well, it's nice not to have to force the whole world to change at once. Bazaar-Subversion integration allows you to use Bazaar without any changes required from the project administrators to the central Subversion server.<br /><br />There are three general cases, where you would want to use bzr-svn:<br /><ol><li>Upstream uses Subversion, and you don't yet have commit access. With bzr-svn, you are able to still make your improvements with all the benefits of a great VCS.</li><li>Project has chosen to use Subversion, you want something better, but still want to play nice with your fellow developers. You can commit to your local Bazaar branch, and push those changes back into Subversion. You can even do "bzr commit" in your Subversion checkout and have it commit those changes to the Subversion server.</li><li>Migration from Subversion to Bazaar. Often when migrating from once VCS to another, there is a period of time where people are adjusting to the new system. bzr-svn allows you to continue allowing people to commit to Subversion, it's just another branch with changes to be merged.</li></ol><br /><span style="font-weight: bold;font-size:130%;" >Overview</span><br />Currently the bzr-svn dependencies can be a bit tricky to install on some platforms, but that should be much easier once Subversion 1.5 is released. Once you get things <a href="http://bazaar-vcs.org/BzrForeignBranches/Subversion#requirements">installed</a>, it's pretty amazing what you can do. On most debian based systems, it is a simple "<span style="font-family: courier new;">apt-get install bzr-svn</span>" away.<br /><br />Once you have bzr-svn installed, you can start using Subversion branches as though they were regular Bazaar branches.<br /><br /><br /><span style="font-weight: bold;font-size:130%;" >General usage</span><br /><br />Now that you have bzr-svn installed, how do you get a local copy of your Subversion project? Generally, it is just a "<span style="font-family: courier new;">bzr checkout URL</span>" away.<br /><br /><span style="font-family: courier new;"> $ bzr checkout svn+https://your-project.googlecode.com/svn/trunk</span><br /><br />This will create a local checkout of your project that contains a local copy of the history present remotely.<br /><br />You should now be able to use this branch like any regular Bazaar branch. Since this is a bound branch, any commits you make will also be show up in the Subversion repository.<br /><br />It is possible to create new local branches from this branch, for example for feature branches::<br /><br /><span style="font-family: courier new;"> $ bzr branch trunk feature-1</span><br /><br />And to merge the branch back into Subversion once it is finished, you can use merge like you would with any ordinary Bazaar branch<br /><br /><span style="font-family: courier new;"> $ bzr merge ../feature-1</span><br /><span style="font-family: courier new;"> $ bzr ci -m "Merge feature-1"</span><br /><br />In addition to the code changes, bzr-svn will write metadata about the history of the new commit into Subversion. This means that your merge history is available, so when someone else comes along and grabs a copy of the branch using Bazaar, they can see what happened. To a normal Subversion client this is transparent, the custom properties are simply ignored.<br /><br />It is also possible to push directly from the feature branch into Subversion::<br /><br /><span style="font-family: courier new;"> $ bzr push http://subversion/project</span><br /><br />This will preserve all of the history from the branch you are pushing - there is no need to rebase your local branch after pushing.<br /><br />Since bzr-svn allows access to Subversion protocols and file formats using the standard Bazaar API, it is possible to use most standard Bazaar commands directly on Subversion formats and URLs. Commands like "<span style="font-family: courier new;">bzr missing</span>", "<span style="font-family: courier new;">bzr log</span>", or even "<span style="font-family: courier new;">bzr viz</span>" work out of the box.<br /><br /><span style="font-weight: bold;font-size:130%;" >Miscellaneous<br /></span><br />Some bits and pieces to pique your interest in bzr-svn.<br /><ul><li>Subversion 1.5 introduces custom revision properties - this should allow bzr-svn to hide the properties used to store merge information. (At the moment, the file properties used show up in commit emails.)<br /></li><li>Bazaar will soon be introducing <a href="http://bazaar-vcs.org/HistoryHorizon">shallow (stacked) branches</a>. This will allow you to have a fully functioning local branch (including offline commits, etc), without needing to download the complete history to your local machine.</li><li><a href="http://live.gnome.org/BzrForGnomeDevelopers">Bzr for GNOME developers</a> is a quick guide for people who want to use Bazaar for developing with the Subversion Gnome repository.</li><li><a href="http://www.python.org/dev/bazaar/">Bazaar branches of Python</a> are available. They are currently using bzr-svn to mirror the Subversion branches, allowing their developers to see what life is like developing with Bazaar.<br /></li></ul>For more information, check out the bzr-svn <a href="http://bazaar-vcs.org/BzrForeignBranches/Subversion">home page</a>, <a href="http://samba.org/%7Ejelmer/bzr-svn/FAQ.html">FAQ</a>, <a href="https://bugs.launchpad.net/bzr-svn">bug tracker</a>, or join us on the Bazaar <a href="https://lists.ubuntu.com/mailman/listinfo/bazaar">mailing list</a>.<br /><br /><br />Next week: how to print money with Bazaar.jamhttp://www.blogger.com/profile/17344213294371886790noreply@blogger.com0tag:blogger.com,1999:blog-4423175964608972068.post-85340491269724358672008-05-15T13:54:00.006-06:002008-05-16T12:54:47.376-06:00This Week in BazaarThis is the second in a mostly-every-week series of posts about whats been happening in the development world of the Bazaar distributed version control system. The series is co-authored by John Arbash Meinel, one of the primary developers on Bazaar, and Elliot Murphy, Launchpad developer and compulsive conflict avoider.<br /><br /><span style="font-size:130%;"><span style="font-weight: bold;">Plugins</span><br /></span><br />One of the nice things about Bazaar is the API, which enables new features to be added with plugins. Once a feature is polished and proves widely useful, it can move from a plugin into core bazaar. Most of the plugins are hosted/mirrored on Launchpad, and are a simple "<span style="font-family: courier new;">bzr branch lp:bzr-plugin ~/.bazaar/plugins/plugin</span>" away. For the rest, they are indexed at <a href="http://bazaar-vcs.org/BzrPlugins">http://bazaar-vcs.org/BzrPlugins</a>. Here's a quick summary of some of the plugins we are using on our laptops right now:<br /><br /><span style="font-weight: bold;">bookmarks</span>: This allows me to store an alias for a branch location, so it is easier to branch/push to a common location. So I can type '<span style="font-family: courier new;">bzr get bm:work/foo</span>' instead of 'b<span style="font-family: courier new;">zr get bzr+ssh://server.example.com/dev/stuff/foo</span>'<br /><br /><span style="font-weight: bold;">bzrtools</span>: a collection of commands which provide extended functionality. Such as '<span style="font-family: courier new;">bzr cdiff</span>' to display colored diffs and '<span style="font-family: courier new;">bzr shelve</span>' to temporarily revert sections of changes.<br /><br /><span style="font-weight: bold;">difftools </span>and <span style="font-weight: bold;">extmerge</span>: These plugins let me view differences in meld or kdiff3 (or anything that you want to configure, really), and do merges via meld.<br /><br /><span style="font-weight: bold;">email</span>: Keep people informed of what you are working on by sending an email after every commit.<br /><br /><span style="font-weight: bold;">fastimport</span>: This plugin allows me to import code from my friends mercurial repository and push it to launchpad.<br /><br /><span style="font-weight: bold;">git</span>: this gives me read access to a local git repository<br /><br /><span style="font-weight: bold;">gtk</span>: This is the Bazaar Gtk GUI, which has some nice tools like visualize and gcommit.<br /><br /><span style="font-weight: bold;">htmllog</span>: Useful for generating html formatted logs for publishing on the web.<br /><br /><span style="font-weight: bold;">loom</span>: Allows me to manage several "layers" of development in a single branch, and colloborate on those layers with other people.<br /><br /><span style="font-weight: bold;">notification</span>: Gives a GUI popup when a pull or push completes<br /><br /><span style="font-weight: bold;">pqm</span>: This formats a merge request to PQM. PQM then takes my branch, merges to main, runs tests, and commits the merge if all was well. This ensures that we always have passing tests in the main tree!<br /><br /><span style="font-weight: bold;">push_and_update</span>: This updates the working tree when I push my branch to a remote server. Very useful for doing website updates.<br /><br /><span style="font-weight: bold;">removable</span>: I try to keep all branches very small for easier review, so I have a lot of branches at one time. This tells me which branches have already been merged to the main tree (and thus can be removed). It can also let me know why something is not ready to be removed.<br /><br /><span style="font-weight: bold;">stats</span>: Provides '<span style="font-family: courier new;">bzr stats</span>' which gives a simple view of how many people have committed to your project and how many commits each has done.<br /><br /><span style="font-weight: bold;">update_mirrors</span>: '<span style="font-family: courier new;">bzr update-mirrors</span>' recursively scans for Bazaar branches and updates them to their latest upstream.<br /><br /><span style="font-weight: bold;">vimdiff</span>: Adds the commands '<span style="font-family: courier new;">bzr vimdiff</span>' and '<span style="font-family: courier new;">bzr gvimdiff</span>'. Which opens vim in side-by-side mode, showing you your changes.<br /><br /><span style="font-weight: bold;">qbzr</span>: Another great GUI for bzr, this one is written using Qt.<br /><br /><br /><span style="font-weight: bold;font-size:130%;" >1.5rc1, 1.5 this Friday</span><br /><br />Continuing our pattern of having time-based releases, bzr 1.5rc1 was released last Friday, and 1.5 final should be released tomorrow. Ever wonder how we churn out releases so regularly? The biggest factor enabling us to make consistent releases is our use of a <a href="http://bazaar-vcs.org/PatchQueueManager">Patch Queue Manager</a>. It ensures that all of our 11,724 unit tests pass before allowing any merge into mainline. Even when lots of changes are landing, the trunk can be considered release quality. Most of the developers use the tip of mainline for their day-to-day work, which means that any changes get immediate use, rather than waiting for a release candidate. <br /><br />By releasing every month, we have reduced the tendency to rush patches, trying to sneak them in before the next release. We know that there will be another release just around the corner, so we can land complex patches right after a release. For each release cycle, we have 3 weeks of "open" development, where any approved (peer reviewed) patch can be merged. Then we have a feature freeze week, where only bug fixes are supposed to be merged. At the end of the freeze week, we release RC1 and reopen mainline for development. If no regressions are found in RC1, it is tagged as final and released after one week.<br /><br />The bzr-1.5 release is mostly focused on fixing small ui bugs, a couple of performance improvements, and some documentation updates.<br /><br />(edit: 2008-05-16, the merged plugin changed and is now called bzr-removable)jamhttp://www.blogger.com/profile/17344213294371886790noreply@blogger.com3tag:blogger.com,1999:blog-4423175964608972068.post-91612029264500798022008-05-14T09:37:00.003-06:002008-05-14T12:46:14.187-06:00Creating a new Launchpad Project (redux)A while back I <a href="http://jam-bazaar.blogspot.com/2007/03/11-steps-to-creating-new-launchpad.html">posted</a> about how to set up a new launchpad project. At the time it took quite a few steps to set everything up that you wanted. I'm happy to report that a lot of those steps have been streamlined, so I posting a new step-by-step instruction for setting up your project in Launchpad.<br /><br /><ol><li>Make sure the project isn't already registered. A lot of upstream projects have already been registered in Launchpad, as it is used to track issues in Ubuntu. So it is always good to start on the main page and use the search box "Is your project registered yet?".</li><li>If you don't find your project, there will be a link to <a href="https://launchpad.net/projects/+new">Register a new project</a></li><li>The form for filling out your project details has been updated a bit, but you should know the answers. (I still use 'bazaar' as the "part of" super-project, and bzr-plugin-name for my plugins)</li><li>This is where things start to get easier. After you have registered the project you can follow the Change Details link. This is generally https://launchpad.net/PROJECT/+edit. It was the same before, but now more information is on a single page, so you can set up more at once. Here I always set the bug tracker to be Launchpad, I click the boxes to opt-in for extra launchpad features.</li><li>Optionally you can assign the project to a shared group. Follow the "Change Maintainer" link (https://launchpad.net/PROJECT/+reassign). I generally assign them to the bzr group, because I don't want to be the only one who can update information.</li><li>At this point you should be able to push up a branch to be used as the mainline using:<br /> <span style="font-family: courier new;">bzr push lp:///~GROUP/PROJECT/BRANCH</span><br />in my example, this is <span style="font-family: courier new;">lp:///~bzr/PROJECT/trunk</span>. (You may need to run '<span style="font-family: courier new;">bzr launchpad-login</span>' so that bzr knows who to connect as, rather than using anonymous http:// urls)</li><li>You now want to associate your mainline branch with the project, so that people can use the nice <span style="font-family: courier new;">lp:///PROJECT</span> urls. You can follow the link on your project page for the "trunk" release series (usually this is https://launchpad.net/PROJECT/trunk) On that page is a "Edit Source" link, or https://launchpad.net/PROJECT/trunk/+source.<br />Set the official release series branch to your new <span style="font-family: courier new;">~GROUP/PROJECT/BRANCH</span>.</li></ol>See, now it is only 7 steps instead of 11. (Though only really one or two steps has actually changed.)jamhttp://www.blogger.com/profile/17344213294371886790noreply@blogger.com1tag:blogger.com,1999:blog-4423175964608972068.post-87343461677889353292008-05-08T14:00:00.007-06:002008-05-15T13:12:03.427-06:00This Week In Bazaar First EditionThis is the first in a mostly-every-week series of posts about whats been happening in the development world of the Bazaar distributed version control system. The series is co-authored by<a href="https://launchpad.net/%7Ejameinel"> John Arbash Meinel</a>, one of the primary developers on Bazaar, and <a href="https://launchpad.net/%7Estatik">Elliot Murphy</a>, Launchpad developer and wanted criminal.<br /><br />We get to talk about anything we want. This week:<br /><ul><li>What's been happening for a better GUI on Windows</li><li>What's new in the 1.4 release</li><li>Importing from other VCS's with bzr fast-import</li></ul>... details ...<br /><br /><span style="font-size:130%;"><span style="font-weight: bold;">GUI on Windows</span><br /></span><br />We found this guy named <a href="http://python.net/crew/mhammond/">Mark Hammond</a> who claims to know how to make python stuff work well on windows. There is an existing GUI tool for Bazaar on Windows called <a href="http://bazaar-vcs.org/TortoiseBzr">TortoiseBZR</a> now, modeled after TortoiseSVN. If you haven't used a Tortoise before, they are extensions that integrate into Windows Explorer; allowing you to see and control the versioning of your files without needing to change to a separate tool.<br /><br />Mark has taken a look and proposed a <a href="http://doc.bazaar-vcs.org/bzr.dev/developers/tortoise-strategy.html">series of enhancements</a> to make the tool work even better. Bazaar already works very well from the Windows command prompt, but we want to provide excellent GUI tools as well. Take a look at the <a href="http://bazaar-vcs.org/TortoiseBzr">TortoiseBZR</a> web page for screenshots of it in action.<br /><br /><span style="font-weight: bold;font-size:130%;" >What's new in the 1.4 release</span><br /><br />The Bazaar team releases a new version of Bazaar just about every month, with both bugfixes and new features. The bzr-1.4 release came out last Thursday, May 1st.<br /><br />The major changes for 1.4 include improvements in performance of 'log' and 'status', and a new Branch hook called <a href="http://doc.bazaar-vcs.org/bzr.dev/en/user-reference/bzr_man.html#post-change-branch-tip-branch">post-change-branch-tip</a>, which will trigger any time a Branch is modified (push, commit, etc). This should enable server generated emails whenever somebody publishes their changes. Write something cool with it and tell us what you did!<br /><br />The full list of changes for 1.4 can be found at: <a href="https://launchpad.net/bzr/1.4/1.4">https://launchpad.net/bzr/1.4/1.4</a><br />The list of all changes is at <a href="http://doc.bazaar-vcs.org/bzr.dev/en/release-notes/NEWS.html">http://doc.bazaar-vcs.org/bzr.dev/en/release-notes/NEWS.html</a><br /><br /><span style="font-weight: bold;font-size:130%;" >bzr fast-import</span><br /><br />Bazaar fast-import is a <a href="https://launchpad.net/bzr-fastimport">plugin for bazaar</a> that allows you to import from many different version control systems. The fast-import stuff is intended to support any system that can use the fast-export format. This format was originated by git developers, and quickly adopted elsewhere. So if a source format can generate a <a href="http://www.kernel.org/pub/software/scm/git/docs/git-fast-import.html">"fast-import"</a> stream, you should be able<br />to import it into Bazaar.<br /><br /><ul><li>CVS<br />To convert from cvs, you currently use the <a href="http://cvs2svn.tigris.org/cvs2svn.html">cvs2svn</a> converter. Which has a flag to generate a "fast-import" stream.</li><li>Mercurial<br />There is a script called hg-fast-export.py bundled with the plugin (in the exporters/ directory).</li><li>SVN<br />The svn-fast-export script is also bundled with the bzr-fastimport plugin.</li><li>git<br />Bundled with the standard git distribution is the <a href="http://www.kernel.org/pub/software/scm/git-core/docs/git-fast-export.html">git-fast-export</a> command.</li><li>Your own exotic system here.</li></ul>Give fast-import a try. It's mostly designed for 1-time conversions, rather than mirroring, but there are already some rudimentary mirroring capabilities.<br /><br /><br />That's all for the first installment of "This Week in Bazaar".<br /><br /><span style="font-size:85%;">(edited for formatting)</span>jamhttp://www.blogger.com/profile/17344213294371886790noreply@blogger.com1tag:blogger.com,1999:blog-4423175964608972068.post-78098902170498850842007-10-17T11:40:00.001-06:002007-11-15T09:37:41.073-06:00Bazaar vs SubversionEvery so often someone comes along wanting to know which VCS they should use. I won't claim to be an impartial observer, but this is a list of things I put together for the last discussion, that I thought I would share here.<br /><ol><li>SVN <span style="font-weight: bold;">requires</span> all commits to go to a central location, and tends to favor having multiple people working on the same branch.<br /><br />This is both a positive and a negative depending on what you are trying to do.<br /><br />When you have a bunch of developers that don't know a lot about VCS, it simplifies things for them. They don't have to worry about branches, they just do their work and check it in.<br />The disadvantage is that they can tread on each other's toes (by committing a change<br />that breaks someone else's work), and their work immediately gets<br />mixed together and can't be integrated separately.<br /><br />Bazaar has chosen to address this with workflows. You can explicitly have a branch set up to send all commits to a central location (<tt>bzr checkout</tt>), just as you do with SVN. Also, if two people checkout the same branch, they must stay in sync. (Bazaar actually has a stronger restriction here than SVN does, because SVN only complains if they modify the same files, whereas Bazaar requires that the whole tree be up to date.)<br /><br />However, with a Bazaar checkout, there is always the possibility to either <tt>bzr unbind</tt> or just <tt>bzr commit --local</tt> when you are on a plane, or just want to record in-progress work before integrating it into the master branch.<br /><br /></li><li>SVN has a lot more 3rd party support.<br /><br />SVN has just been around longer, and is pretty much the dominant open source centralized VCS. There are a lot of DVCSes at the moment, all doing things a little bit differently. Competition is good, but it makes it a bit more difficult to pick one over the other, and 3rd party tools aren't going to build for everyone.<br /><br />However, Bazaar already has several good third party tools. For viewing changes to a single file, <tt>bzr gannotate</tt> can show when each line was modified, and what the associated commit message was. It even allows drilling back in history to prior versions of the file.<br />For viewing the branch history (showing all the merged branches, etc) there is <tt>bzr viz</tt>.<br />There are both gtk and qt GUIs, a Patch Queue Manager (PQM) for managing an<br />integration branch (where the test suite always must pass or the patch is<br />rejected.)<br />There is even basic Windows Shell integration (TortoiseBzr), a Visual Studio plugin, and an Eclipse plugin.<br /><br /></li><li>Bazaar is generally much easier to set up.<br /><br />SVN can only really be set up by an administrator. Someone who has a bit more of an idea what they are doing. Setting up WebDAV over http is easier than it used to be, but it isn't something you would ask just anyone to do. Getting a project using Bazaar is usually as simple as <tt>bzr init; bzr add; bzr commit -m "initial import"</tt>.<br /><br />You can push and pull over simple transports (ftp, sftp, http).<br /><br />Because SVN is centralized, you only really set it up one time anyway, so as long as you have one competent person on your team, you can probably get started.<br /><br /></li><li>It is easier to get 3rd party contributions.<br /><br />If you give a user commit access to your SVN repository, then you have their changes available whenever they commit. But usually this also means that they have access to change things that you don't really want them to touch. (Yes, there are ACLs that you can set up, but I don't know many projects that go to that trouble for casual contributors.)<br /><br />If you haven't given them commit access, then they have to work on their own, and the VCS doesn't give you a direct way to collaborate with them. You are back to using something like diff+patch.<br /><br />Because Bazaar supports intelligent merging between "repositories" integrating other people's work is usually a <tt>bzr merge</tt> away. SVN 1.5 is supposed to address the merge issue, but at best it helps within a repository. So if someone is developing stuff on their own side, you are still stuck with diff + patch.<br /><br />Just to reiterate, Bazaar can make it much easier for getting users to give "drive-by" contributions. Which can be a good stepping stone towards increasing your development community.<br /><br /></li><li>Subversion's model is a giant versioned filesystem. Bazaar uses a concept of a Tree.<br /><br />I have little doubt that this made tracking merging more difficult in SVN, since there isn't a clear 'top' that has been merged with the other 'top'.<br /><br />It also means that SVN commits aren't atomic in the same way that Bazaar commits are. In Bazaar, when you commit, you are guaranteed to be able to get back to that same revision. With SVN, if people are working on different files, both can commit, and when you checkout the final tree, it will not match either side.<br />This has some implications for assuring that the test suite passes on a given branch,<br />since the test suite can pass on my machine, and on their machine, but after we both commit, it won't pass after doing a checkout.<br /><br /></li><li>SVN supports partial checkouts better than Bazaar does.<br /><br />This is mostly a consequence of the above point, rather than an explicit thing. But because SVN doesn't label anything as a special Tree, you can check out <tt>project/doc</tt> just as easily as <tt>project</tt><br /><br />We are looking into ways to at least fake this with Bazaar (we secretly check out the whole tree, but hide bits that you don't care about). Because we are aware of use cases where it is important. (A documentation team that doesn't want or need to see all the code, etc.)<br /><br /></li><li>SVN stores history on the server.<br /><br />In the standard workflows, Bazaar has you copy the full project history to your local machine. For most projects, this isn't a big deal, because the delta compressed history is only a small multiple of a checked out tree. (Plus SVN always checks out 2 copies anyway.)<br />But there are times when people abuse the VCS, and check in a CD ISO (which gets deleted shortly thereafter). Suddenly you have more garbage data in your repository than you have desirable data.<br /><br />Bazaar does have support for "lightweight checkouts" which are SVN style working directories. Where all the history is on the server, and only the working tree is local. Of course if you do this, you lose some flexibility (offline commits), but you get to chose when that fits your needs.<br /><br />We also have "shared repositories" which can be used to share storage between branches. So even though you have 10 branches, you only have 1 copy of the history.<br /><br />We are working on having a <a href="http://bazaar-vcs.org/HistoryHorizon">Shallow Branch/History Horizon</a> which should be a very good compromise between the two. The basic idea is that it can pull down data that you are using, without needing the full history.<br /><br /></li><li>Storage of Binary Files<br /><br />At the moment SVN's delta algorithm for binary files is able to give smaller deltas than ours does. This is likely to change in coming releases, but at the moment there will be times when SVN requires less disk space for binary files that you modify often. For binary files that change infrequently, or for compressed ones, there is likely to be less of a difference. (Most compressed formats don't delta well because a small change causes ripples in the compressed stream.)<br /><br /></li><li>Handling <span style="font-weight: bold;">large</span> files<br /><br />At the moment, Bazaar has the expectation that you can fit a small number of copies of the contents of any file in memory. (The merge algorithm needs a BASE, THIS, and OTHER copy.)<br />So when you need to version 1GB movies, etc, SVN is probably a better choice at the moment. You might consider if it is actually the right way to handle those files.<br /><br />We are certainly considering changing some parts of our code to be able to only read parts of files. But it is lower on our list of priorities.<br /><br /></li><li>Building up a project out of subprojects<br /><br />At the moment SVN's <tt>externals</tt> handle more use cases than we do.<br />We are working on more complete support with <a href="http://bazaar-vcs.org/NestedTreeSupport">Nested Trees</a>. The internal data structures are present, but not all of the push/pull/merge/etc commands have been updated.<br /><br />We already have good support for merging in a project into another project, so you get 1 large tree. And then you can continue to merge new changes from upstream, and it will apply to the correct files. However, once you have aggregated a project, it is harder to send any of your own changes upstream, independent of all the other files. (It is possible to do so, but it requires you to cherry pick the changes, and track when you modify which files.)<br /><br />Also, Nested Trees are designed to allow you to easily checkout an exact copy of the full project at the exact revision of every sub-project, while still allowing you to 'bzr update' them to the current version of all the sub projects.<br /><br /></li><li>Clarity of "log"<br /><br />One major difficulty with CVS is just figuring out what has been changing. With<br />Bazaar, you can do a simple <tt>bzr log</tt> and it shows you what has been<br />changing for the whole branch. SVN has a similar <tt>svn log</tt> which shows<br />you what has been changing underneath the current directory. (So they are<br />approximately the same,if you are in the root of an SVN branch.)<br /><br />However, if you use feature branches to develop, and then have an integration<br />branch (trunk) with Bazaar you can do <tt>bzr log --short</tt> which shows only<br />the mainline revisions. In this case, that would be just the integration summary<br />messages. So you can see a single "merged feature X" message, rather than the<br />50 small commit messages that build up into that feature.<br /><br /></li><li>Plugin Architecture<br /><br />One of Bazaar's main strengths is the ability for third party developers to add<br />commands or customization through the use of plugins. Plugins can provide<br />simple extensions (a different log format to conform to a companies particular<br />style expectations), new commands (history introspection, extra patch management,<br />integration with the PQM), or even support for a different repository format (at the<br />moment <tt><a href="https://launchpad.net/bzr-svn">bzr-svn</a></tt> provides a way to treat an SVN repository as just another Bazaar branch, allowing you to push, pull and merge.)<br /><br />While not every user is going to want to write a plugin, it does provide ways<br />for administrators to customize the behavior of Bazaar, so that the tool can be<br />slimmed down to provide just the basics, or expanded to provide specific<br />workflows customized to the situation.<br /><br /></li><li>Rename support<br /><br />This is another place where SVN is much better than CVS, but Bazaar is even better still.<br />SVN has support for the basic concept of renaming, though it is implemented as a copy+delete pair. "copy" allows 2 files to have the same history prior to the point of copying. Which means commands like <tt>svn log</tt> and <tt>svn annotate</tt> use the full history of the file, but there is more that can be done.<br /><br />One of the reasons projects hesitate to rename files, is because then it becomes difficult to accept changes from elsewhere. Suddenly the change has nowhere to go, because the target file is not there anymore. And this is where Bazaar has a distinct advantage over SVN. When you rename a file, Bazaar knows that any patches to that file belong in the new destination. Which means that when you need to refactor your code to clean up the overall structure, you can still merge changes that were created before the restructuring. I know I didn't realize how differently I worked with my code before I had the ability to fix simple name errors. (This file is <tt>'Bars.c'</tt> when it should just be <tt>'bar.c'</tt>, etc.)<br /><br /></li><br /><br />I also wanted to point to a pretty good blog post about Subversion and the rest of the world <a href="http://blog.red-bean.com/sussman/?p=79">here</a>. A lot of that is why Bazaar has a centralized workflow you can use, and why we are trying to make sure things like <a href="https://launchpad.net/bzr-gtk">bzr-gtk</a> (which is the parent project for <a href="http://bazaar-vcs.org/Olive">Olive</a> and <a href="http://bazaar-vcs.org/TortoiseBzr">TortoiseBzr</a>) are fully functional.<br /></ol><br />In summary, SVN may be a better choice if you have large binary files, projects with subprojects, need partial checkout support or more mature integration with 3rd party tools than Bazaar currently has. OTOH, if workflow flexibility is important, collaborating with others and increasing community participation matter, low administration is appealing or you care about quality branching/merging and correct rename handling, then Bazaar can help make life more enjoyable and ought to be seriously considered either now or in the future, depending on how comfortable you are with its maturity.jamhttp://www.blogger.com/profile/17344213294371886790noreply@blogger.com9tag:blogger.com,1999:blog-4423175964608972068.post-46715609114330317742007-05-21T08:46:00.000-06:002007-05-21T08:54:09.191-06:00Ogg Vorbis and iTunesI've been a longtime supporter of Ogg Vorbis, and I'm also a Mac user. While I haven't figured out how to get my iPod to play Ogg just yet, I have worked on getting iTunes to play it. I periodically do searches to see if things have improved, but they seem to return mostly old data.<br /><br />So I just wanted to get it out that the good people at Xiph have started maintaining the Ogg Vorbis plugin. It is available <a href="http://www.xiph.org/quicktime/download.html">here.</a><br /><br />I don't seem to be able to find the page again, but I thought I read there were some small problems with the last release. They have development snapshots <a href="http://people.xiph.org/%7Earek/">here</a>. At least so far, I haven't run into any problems with it. And overall it seems to consume fewer CPU resources than the older releases.jamhttp://www.blogger.com/profile/17344213294371886790noreply@blogger.com1