John Arbash Meinel's Bazaar Blog: tdd

This is the fifth in a series of posts about current topics in the development world of the Bazaar distributed version control system. The series is co-authored by John Arbash Meinel, one of the primary developers on Bazaar, and Elliot "fresh needle" Murphy. The two topics for this week are not related, but it's our blog and we get to write what we want.

Hosting of Bazaar branches

One of the first questions people ask when moving to Bazaar is "Where can I host my branches?" Even with distributed revision control, it is often handy to have a shared location where you publish your code, and merge code from others. Canonical has put a lot of work into making launchpad.net an excellent place to host code, but there are many other options available.

Because bazaar supports "dumb" transports like sftp, you can publish your branches anywhere that you can get write access to some disk space. For example, sourceforce.net gives projects some web space with sftp access, and you can easily push branches up over sftp. It's also easy to use bzr on any machine that you have ssh access to, you don't even need to install bazaar on the remote machine.

As bazaar is a GNU project, we've been working with the Savannah team to enable bazaar hosting on Savannah also.

Another option is serving bazaar branches over HTTP. You can do this for both read and write access, and there is a great HOWTO in the bazaar documentation. Do you know of anywhere else that is offering Bazaar hosting? Let us know in the comments!

Bazaar review and integration process

How do you ensure high quality code, when working on a fast moving codebase in a widely distributed team? Here are some things that we've been doing with the Bazaar project, and we think they are useful practices for most projects.

Automated Test Suite

One very important key towards having a stable product is proper testing. As people say "untested code is broken code". In the Bazaar project, we recommend that developers use Test Driven Development as much as possible. However, what we *require* is that all new code has tests. The reason it is important for the tests to be automated, is because it transfers knowledge about the code base between developers. I can run someone else's test, and know if I conformed to their expectations about what this piece of code should do.

This actually frees up development tremendously. Especially when you are doing large changes. With a good test suite, you can be confident that after your 2000 line refactoring, everything still works as before.

Code Review

Having other people look at your changes is a great way to catch the little things that aren't always directly testable. It also helps transfer knowledge between developers. So one developer can spend a couple weeks experimenting with different changes, and then there is at least one other person who is aware of what those are.

The basic rules for getting code merged into Bazaar is that:

It doesn't reduce code clarity
It improves on the previous code
It doesn't reduce test coverage
It must be approved by 2 developers who are familiar with the code base.

We try to apply those rules to avoid hitting the rule "The code must be perfect before it is merged", and the associated project stagnation. Code review is a very powerful tool, but you have to be cautious of "oh, and while you are there, fix this, and that, and this thing over here." Sometimes that is useful to catch things that are easy (drive-by fixes). It can also lead to long delays before you actually get to see the improvements from someone's work, and long delays are demotivating.

Item number 3 is a pragmatic way to approach how much testing is required. In general, the test coverage should improve, jsut like the code quality. But that doesn't mean you have to test all code 100 different ways.

Integration Robot (PQM)
Now that you have a good automated test suite, and proper code reviews, the next step is to make sure that you have a version of the code base which has all the tests passing. Often when developing a new feature, it is quite reasonable to "break" things for a bit, while you work out just how everything should fit together. Requiring the tests to pass at each level of development puts an undue burden on developers, preventing them from publishing their changes (to get feedback, to snapshot their progress, etc.) Very often I commit when things are still somewhat broken, as it gives me a way to 'bzr revert' back to something I wanted to try.

However, you don't want the official releases of your project to have all of these broken bits in them. The Bazaar project uses a "Patch Queue Manager". Which is simply a program that responds to requests to merge changes. When your patch has passed the review stages, you submit it to the PQM, which grabs your changes, applies them, runs the full test suite, and commits the changes to "mainline" if everything is clean.

The reasons to use a robot are:

Humans are very tempted to say "ah, this is a trivial fix, I'll just merge it". Without realizing there is a subtle typo or far-reaching effect. When you have a large test suite, it can often take a while to run all the tests (the bzr.dev test suite runs in 5-10 minutes, but some projects have test suites that take hours.) Having a program doing the work means a human is relieved of the tedium of checking it.
There is generally only a single mainline, but there may be 50 developers doing work on different branches. When they all want to merge things, it isn't feasible to require the users to run the test suite with the latest development tip. If the development pace is fast enough versus the time to run the test suite, you can get into an "infinite loop". Where you merge in the latest tip, wait for the tests to pass, and by the time you go to mark this as the new mainline tip, someone else beat you to it. And you go around again. PQM does this for you in a fairly natural way.
Running in a "clean" environment is a safety net for when you forget about a new dependency that you added.
There are similar ways to do this, such as Cruise Control for Subversion. There is one key difference, though. With Cruise Control, you find out after the fact that your mainline has broken. With PQM, we know that every commit to the mainline of Bazaar passed the test suite at that time. This helps a lot when tracking down bugs. It also helps with "dogfooding"...

Dogfooding

If you want people to do regular testing of the development version, it must be easy to run different versions of the project without needing a complex install. Bazaar does this by being runnable directly out of the source tree, without any need to set $PYTHONPATH or mess around with installing different versions. You can also easily change out the plugins that are loaded using the BZR_PLUGIN_PATH env variable. This means that developers can run the latest development version, and easily switch to a particular version when trying to reproduce a bug or help a user.

By having the PQM running the test suite, developers can run on the bleeding edge, and know that they won't get random breakage. It is always possible that something will break, but the chance is quite low. (In 2000 or so commits since we started using PQM, I believe bzr.dev has never been completely unusable, and has had < 5 revisions which we would not recommend people use.)

It also means that you can be fairly confident in creating a release directly from your integration branch (mainline).

For the Bazaar project we have a general goal that all code should be tested. We have an explicit development workflow that all new code must have associated tests before it can be merged into mainline.

Our latest release (0.15, rc3 is currently out the door, final should happen next week), introduces a new disk format, and a plethora of new tests. ('bzr selftest' in 0.14 has approx 4400 tests, and 5900 in 0.15). A lot of these are interface tests. Since we support multiple working tree, branch, and repository formats, we want to make sure that they all work the same way. (So only 1 tests is written, but it may be run against 4 or 5 different formats).

It means that we have a very good feeling that our code is doing what we expect it to (all of the major developers dogfood on the current mainline). However, it comes at a bit of a cost. In that running the full test suite gets slower and slower.

Further, I personally follow more of a 'test-after-development'. And I'm trying to get into the test driven development mindset. I don't know how I feel just yet, but I was reading this. And whether you agree with all of it, it makes it pretty clear how different the mindset can be. It goes through several iterations of testing, coding, and refactoring before it ends up anywhere I consider "realistic". And a lot of that comes at the 'refactoring' step, not at the coding step.

I have a glimpse at how it could be useful, as the idea is to have very small iterations. Such that it can be done in the 3-5 minute range. And every 3-5 minutes you should have a new test which passes. It means that you frequently have hard-coded defaults, since that is all the tests require at this point. But it might also help you design an interface, without worrying about actually implementing everything.

He also makes comments about keeping a TODO list. Which was part that made the most sense to me. Because you can't every write all the code fast enough to get all the ideas out of your head. So you keep a TODO so you don't forget, and also so you don't feel like you need to track down that path right now.

The other points that stuck with me are that most tests should be "unit tests". Which by his definition means they are memory only very narrow in scope. And that the test suite should be fast to run, because once it gets under a threshold (his comment was around 10 seconds, not minutes) then you can actually run all of them, all the time.

And since a development 'chunk' is supposed to be 3-5 minutes, it is pretty important that the test suite only take seconds to run. The 10s mark is actually reasonable, because it is about as long as you would be willing to give to that single task. Any longer and you are going to be context switching (email, more code, IRC, whatever).

The next level of test-size that he mentions is an "integration" test. I personally prefer the term "functional" test. But the idea is that a "unit" test should be testing the object (unit) under focus, and nothing else. Versus a functional test that might make use of other objects, and disk, database, whatever. And then the top level is doing an "end-to-end" test. Where you do the whole setup, test, and tear down. And these have purpose (like for conformance testing, or use case testing), but they really shouldn't be the bulk of your tests. If there is a problem here, it means your lower level tests are incomplete. They are good from a "the customer wants to be able to do 'X', this test shows that we do it the way they want" viewpoint.

I think I would like to try real TDD sometime, just to get the experience of it. I'll probably try it out on my next plugin, or some other small script I write. I have glimpses of how these sorts of things could be great. Often I'm not sure how to proceed while developing because the idea hasn't solidified in my head. One possibility here is "don't worry about it", create a test for what you think you want, stub out what you have to, and get something working.

Of course, the more I read, the more questions spring up. For example, there is a lot of discussion about test frameworks. Python comes with 'unittest', which is based on the general JUnit (or is it SUnit) framework. Where you subclass from a TestCase base class, and have a setUp(), and tearDown(), and a bunch of test_foo() tests.

But there is also nose and py.test, which both try to overcome unittest's limitations. And through reading about them, there is a discussion that python 3000 will actually have a slightly different default testing library. (For a sundry of technical and political reasons).

And then there is the mock versus stub debate. As near as I can tell, it seems to fall around how to create a unit test when the object under test depends on another object. And which method is more robust, easier to maintain, and easier to understand. That link lends some interesting thought about Mock objects. That instead of testing the state of objects, you are actually making an explicit assertion that the object being tested will make specific calls on the dependency.

I'm not settled on my decision, there. Because it feels like you are testing an exact implementation, rather than testing the side effect (interface). Some of what I read says "yes, that is what you are doing, and that is the point." I can understand testing side-effects. I guess part of it is how comfortable are you with having your test suite evolve. At least some tests need to be there to say that the interface hasn't changed since the previous release. (Or that a bug hasn't been reintroduced). If that edge case was tested by a particular test, and that test gets refactored, do you have confidence you didn't re-introduce the bug?

I guess you could have specific conventions about what tests are testing the current implementation, versus the overall interface of a function or class. I can understand that you want your test suite to evolve and stay maintainable. But at the other end, it is meant to test that things are conforming to some interface, so if you change the test suite, you are potentially breaking what you meant to maintain.

Maybe it just means you need several tiers of tests, each one less likely to be refactored.

John Arbash Meinel's Bazaar Blog

Blog Archive

Contributors

Thursday, June 5, 2008

This Week in Bazaar

Tuesday, March 27, 2007

Test DRIVEN Development