Planet Bazaar

September 01, 2010

Bazaar Developers

farewell, Ian

I am grieved to say that my friend and colleague Ian Clatworthy died last night, after a long and horrible struggle with cancer. He and his wife Geri celebrated their 19th wedding anniversary yesterday before he passed away peacefully in his sleep, at home, with his family.

I’ve known Ian for eleven years and he has worked at Canonical since 2007. He made large contributions to Bazaar, including launching and driving the bzr-explorer project. Even though he had many technical and business achievements, the most remarkable and inspiring thing was what a thoroughly nice man he was. He was determined to change the world for the better, both in software and in how people relate to each other, and he accomplished both. He will be missed, and remembered.

- Martin

[edit: add picture]

Ian Clatworthy 1966-2010


mbpIan Clatworthy 1966-2010

by Martin Pool at September 01, 2010 02:35 AM

August 31, 2010

Bazaar Developers

bugzilla-bzr integration

Max Kanat-Alexander’s new bugzilla-vcs extension (alpha) supports bzr, svn, hg, git, and cvs. Currently it supports linking bugs and commits, and displaying information about about commits in Bugzilla on the show_bug page.


mbp

by Martin Pool at August 31, 2010 08:27 AM

August 25, 2010

Mark Shuttleworth

Open textbooks to the rescue

Mark Horner is a Fellow at the Shuttleworth Foundation. The model of the Foundation is unusual: we identify interesting change agents, like Mark, who are articulating powerful ideas that seem like the offer a hint of the future, and we fund them to work on those for a year. We also offer them an investment multiplier: if they put their personal money into a project, we multiply that by 10x or more, up to a maximum amount. In short, find good people, back them when they put skin in the game.

Mark’s specialty is open content for education: figuring out how to produce textbooks collaboratively. He’s done amazing work in the past, independently, leading an initiative to produce free high school science textbooks, and has lead the acquisition of a full set of textbooks in SA and their publication under an open content licence by the Foundation. Today, he’s been presented with a really awesome opportunity: provide open content to all of SA, with full backing from the department of education.

That’s a huge step forward, putting open content much more at the center of mainstream thinking. In part, this is precipitated by a crisis, the strike action that is affecting many public services like education in South Africa. But it’s nevertheless a valuable opportunity to show how open content can change the dynamic of the rigid world of education.

He needs help, though, to make sure the current drafts of the Maths and Science textbooks are free of typos:

I really need some extremely urgent help, I’ve been approached by national government to try to help make free educational resources available to support education during the current crisis! We have an opportunity to distribute free educational resources to all schools that cover:

  • Grade R – 9 for ALL learning areas in English and Afrikaans
  • Grade 10 – 12 Mathematics
  • Grade 10 – 12 Physical Science

All that is required is another edit of the Free High School Science Texts before they will release them to all the schools in South Africa. We have ONE WEEK to complete this process and desperately need volunteers who have post-graduate degrees in Maths, Physics, Chemistry or related fields that can help out.

So, if you’re inclined, he has details on how to help. For the moment, looks like participation requires being present in Cape Town, but perhaps he has a solution for that too.

by mark at August 25, 2010 11:54 AM

August 20, 2010

Mark Shuttleworth

10.10.10.10.10…..

Saw this URL fly by today… wow and thank you to the Ubuntu Ads guys :-)

So, who’s up for making Maverick Movies? It would be great to have a “10 best features in 10.10″ video collection for release. Unity’s awesome and then there are things to show off in OO.o, Gnome, Firefox…. giving credit where it’s due.

I put together https://wiki.ubuntu.com/MaverickMovies as a starting place to aggregate content. Have subscribed, so if you update that page I’ll see it. If that goes nicely, we can beef the process up in the runup to release.

by mark at August 20, 2010 09:59 AM

August 17, 2010

Mark Shuttleworth

N-imal?

Oh yes, it’s that time of year again, when numerate pollsters make nasal proclamations about the naming of the next next version of Ubuntu. When gazers of balls crystal provide nifty suggestions for new new features and, of course, suitable nomenclature to match.

What will it be? A Naiant Nailtail would make a fine coat of arms, but we’re not really in the business of arms. Most of our businesses have legs. Most, I say. We could hedge our bets and go with the Neutral Newt, but it’s placing bets and seeing them through that raises the game for the free software desktop, and now’s a time of great change and invention, not a time for fence-sitting.

I’ve been procrastinating. The N-evitable nature of our cadence means that calls for something nicer than “Maverick+1″ are increasingly noticeable. Naively, I always assume that the answer will leap off the page. Instead, what leaps off the page is a gazillion permutations and combinations of nubile, naughty, naiad and nymph. Moving swiftly onward I linger on the possibilities of the Numbat. Nah. There’s no doubt Fourecks can be a rich source of inspiration, now’s not the time to celebrate Van Diemen’s Land, we’ve better plans for that. And speaking of Fourecks, the Nobby Noctule sounds like something dreamed up by Terry Pratchett, perhaps a fitting way to move beyond Adam’s 10.10.10, but it really is hard to sing the praises of a bat. Especially one with (k)nobs.

As you can imagine, after a few weeks with a dictionary and colouring in book of animals, I could draw this out N-definitely. The problem is NP-complete, which I’m now reliably informed by the good folks at HP means it’s provably quite difficult and not something that can be delegated to chips of the non-quantum kind. My chips are most definitely non-quantum though my bugs, strangely, are.

Where did that leave us?

Well, let’s look at what we want to get done.

We have this whole design thing in full flow, which is making Ubuntu sleeker and more stylish, as well as making it smoother for those who just want to get stuff done. We’ll make the N release the best-dressed ever. But classy covers don’t equate to good reads – we want style and substance to meet and get along famously. Once Maverick is out the door we’ll be turning our attention to making the most of the amazing capabilities of modern graphics hardware, both for outer beauty and for inner efficiency. There’s a lot more to GL than glitz and glamour, though we won’t say no to either.

We’re also putting a lot of work into chips and architectures (admittedly, not yet of the quantum sort) that keep cool, and help keep the planet cool in the process. So it would be nice to have a codename which reflects that goodness. Some sort of mascot for a cool planet would do the trick.

And so, we come swiftly to a conclusion: allow me to introduce the Natty Narwhal, our mascot for development work that we expect to deliver as Ubuntu 11.04.

The Narwhal, as an Arctic (and somewhat endangered) animal, is a fitting reminder of the fact that we have only one spaceship that can host all of humanity (trust me, a Soyuz won’t do for the long haul to Alpha Centauri). And Ubuntu is all about bringing the generosity of all contributors in this functional commons of code to the widest possible audience, it’s about treating one another with respect, and it’s about being aware of the complexity and diversity of the ecosystems which feed us, clothe us and keep us healthy. Being a natty narwhal, of course, means we have some obligation to put our best foot forward. First impressions count, lasting impressions count more, so let’s make both and make them favourable.

While it may not in fact get you a pony, the world of free software is the platform upon which the future is being built. So the Narwhal, as the closest thing to a real live unicorn, is an auspicious figurehead as we lay down the fabric from which dreams will be woven. Dreams of someone’s first PC, dreams of someone’s first million instances in the cloud: whatever your vision of the future, we hope the Natty Narwhal will have something to offer. Test your gems against that unicorn – some will be glass, others of value. Perhaps the unicorn will bring you Luck, perhaps a cure for poisons proprietary. One thing is certain: we’ll be building it together with thousands of the most generous, insightful, fun people on the planet – not only those in the Ubuntu community, but those who participate in the whole of the free software ecosystem, from a2jmidid to zzliplib, with Debian (happy Birthday!, now longer in the tooth, wiser, but as potent and principled as ever) a special partner. I’m looking forward to the ride, and the result!

by mark at August 17, 2010 06:31 PM

August 16, 2010

Mark Shuttleworth

Gestures with multitouch in Ubuntu 10.10

Multitouch is just as useful on a desktop as it is on a phone or tablet, so I’m delighted that the first cut of Canonical’s UTouch framework has landed in Maverick and will be there for its release on 10.10.10.

You’ll need 4-finger touch or better to get the most out of it, and we’re currently targeting the Dell XT2 as a development environment so the lucky folks with that machine will get the best results today. By release, we expect you’ll be able to use it with a range of devices from major manufacturers, and with addons like Apple’s Magic Trackpad.

The design team has lead the way, developing a “touch language” which goes beyond the work that we’ve seen elsewhere. Rather than single, magic gestures, we’re making it possible for basic gestures to be chained, or composed, into more sophisticated “sentences”. The basic gestures, or primitives, are like individual verbs, and stringing them together allows for richer interactions. It’s not quite the difference between banging rocks together and conducting a symphony orchestra, but it feels like a good step in the right direction ;-)

The new underlying code is published on Launchpad under the GPLv3 and LGPLv3, and of course there are quite a lot of modules for things like X and Gtk which may be under licenses preferred by those projects. There’s a PPA if you’re interested in tracking the cutting edge, or just branch / push/ merge on LP if you want to make it better. Details in the official developer announcement. The bits depend on Peter Hutterer’s recently published update to the X input protocols related to multi-touch, and add gesture processing and gesture event delivery. I’d like to thank Duncan McGreggor for his leadership of the team which implemented this design, and of course all the folks who have worked on it so far: Henrik Rydberg, Rafi Rubin, Chase Douglas, Stephen Webb at the heart of it, and many others who have expanded on their efforts.

In Maverick, quite a few Gtk applications will support gesture-based scrolling. We’ll enhance Evince to show some of the richer interactions that developers might want to add to their apps. Window management will be gesture-enabled in Unity, so 10.10 Netbook Edition users with touch screens or multi-touch pads will have sophisticated window management at their fingertips. Install Unity on your desktop for a taste of it, just apt-get install ubuntu-netbook and choose the appropriate session at login.

The roadmap beyond 10.10 will flesh out the app developer API and provide system services related to gesture processing and touch. It would be awesome to have touch-aware versions of all the major apps – browser, email, file management, chat, photo management and media playback – for 11.04, but that depends on you! So if you are interested in this, let’s work up some branches :-) Here’s the official Canonical blog post, too.

by mark at August 16, 2010 12:50 PM

August 14, 2010

Jonathan Lange

unittest, part 3

So far, we've talked about TestSuites, TestCases and TestResults. We've seen how these objects interact with each other  and how they can generally be thought about as having more than one interface. TestResult has an interface for the TestCase and an interface used for querying the results, TestCase has an interface for test runners and an interface for test authors, and so forth.

Now we need to give some time to the bits that glue everything together: the test runner and the test loader.

TestRunner

You will not find a class in unittest.py called TestRunner. A test runner is simply something that takes user input about a test run – what tests to run, what manner to run them in, how to display the results – and does it.

Essentially, it does something like this:
  test = TestLoader().loadTests(user_specified_test_string)
result = makeTestResult(options_specified_by_user)
result.startTestRun()
try:
test.run(result)
finally:
result.stopTestRun()

And that's it.

You see that the test runner is responsible for instantiating the test loader and the test result. It's perhaps excusable for a test runner to be tightly bound to particular implementations of test loader and test result. Certainly, before TestResult grew startTestRun and stopTestRun it was inevitable: since the test runner was responsible for summarizing the results of a test run, overall responsibility for displaying the results was split between the runner and the result.

Nowadays, the tight coupling can be limited. If your test runner has an option to display stack traces as it gets them, then that's pretty much going to force you to use a particular result. However, you can still write your code internally such that someone could pass in a different result that still works, even though it doesn't do exactly what the user asked for.

TestLoader

From the point of view of interfaces and compatibility, this is a pretty boring class, and that's a good thing. The test loader's job is to find tests based on some user input and construct a single ITest object for them.

When it does more than this, one runs the risk of having the behaviour of a test suite depend too much on the runner itself. The ideal is to have the test suite run in any runner: trial, nose, unittest2, py.test, whatever.

Some TestLoaders provide hooks so that users with complicated test suites can customize the way their tests are loaded. Whenever the Trial TestLoader sees a test_suite() function in a module, it lets that function take charge of the loading.

The standard library in 2.7 has a new hook, inspired by an innovation in bzrlib, but slightly different. load_tests(loader, standard_tests, pattern) is given the loader used by the test runner, the tests that the loader would have loaded, and if appropriate, a glob used for matching test module files. The advantage of this hook is that it reduces the danger of customizations made to the loader, since the test suite has access to the same loader. It also makes custom loading easier by giving the standard tests as a starting point. bzrlib uses this to run the same set of tests against many implementations.

I think that's all I have to say about these two, which means that's pretty much all I have to say about unittest's API for test frameworks. Still one more post to go though: interfaces for test authors.

Let me know if I've missed anything, if anything here surprises you or contradicts something I said in the past or if things are unclear. The comments on the previous two posts have really helped!

by jml (noreply@blogger.com) at August 14, 2010 04:34 PM

August 04, 2010

John Arbash Meinel

Step-by-step Meliae

Some people asked me to provide a step-by-step guide to how to debug memory using Meliae. I just ran into another unknown situation, so I figured I'd post a step-by-step along with rationale for why I'm doing it.
  1. First is loading up the data. This was a dump while running 'bzr pack' of a large repository.
    >>> from meliae import loader
    >>> om = loader.load('big.dump')
    >>> om.remove_expensive_references()
    The last step is done because otherwise instances keep a reference to their class, and classes reference their base types, and you end up getting to 'object', and somewhere along the way you end up referencing too much. I don't do it automatically, because it does remove actual references, which someone might want to keep.
  2. Then, do a big summary, just to get started
    >>> om.summarize()
    Total 8364538 objects, 286 types, Total size = 440.4MiB (461765737 bytes)
    Index Count % Size % Cum Max Kind
    0 2193778 26 181553569 39 39 4194281 str
    1 12519 0 97231956 21 6012583052 dict
    2 1599439 19 68293428 14 75 304 tuple
    3 3459765 41 62169616 13 88 20 bzrlib._static_tuple_c.StaticTuple
    4 82 0 29372712 6 94 8388724 set
    5 1052573 12 12630876 2 97 12 int
    6 1644 0 4693700 1 98 2351848 list
    7 4038 0 2245128 0 99 556 _LazyGroupCompressFactory
  3. You can see that
    1. There are 8M objects, and about 440MB of reachable memory.
    2. The vast bulk of that is in strings, but there are also some oddities, like that 12.5MB dictionary
  4. At this point, I wanted to understand what was up with that big dictionary.
    >>> dicts = om.get_all('dict')
    >>> dicts[0]
    dict(417338688 12583052B 1045240refs 2par)
    om.get_all() gives you a list of all objects matching the given type string. It also sorts the returned list, so that the biggest items are at the
    beginning.
  5. Now lets look around a bit, to try to figure out where this dict lives
    >>> bigd = dicts[0]
    >>> from pprint import pprint as pp
    We'll use pprint a lot, so map it to something easy to type.
    >>> pp(bigd.p)
    [frame(39600120 464B 23refs 1par '_get_remaining_record_stream'),
    _BatchingBlockFetcher(180042960 556B 17refs 3par)]
  6. So this dict is contained in a frame, but also an attribute of _BatchingBlockFetcher. Let's try to see which attribute it is.
    >>> pp(bigd.p[1].refs_as_dict())
    {'batch_memos': dict(584888016 140B 4refs 1par),
    'gcvf': GroupCompressVersionedFiles(571002736 556B 13refs 9par),
    'keys': list(186984208 16968B 4038refs 2par),
    'last_read_memo': tuple(536280880 40B 3refs 1par),
    'locations': dict(417338688 12583052B 1045240refs 2par),
    'manager': _LazyGroupContentManager(584077552 172B 7refs 3716par),
    'memos_to_get': list(186983248 52B 1refs 2par),
    'total_bytes': 774119}
  7. It takes a bit to look through that, but you can see:
    'locations': dict(417338688 12583052B 1045240refs 2par)
    Note that 1045240refs means there are 522k key:value pairs in this dict.
  8. How much total memory is this dict referencing?
    >>> om.summarize(bigd)
    Total 4035636 objects, 22 types, Total size = 136.8MiB (143461221 bytes)
    Index Count % Size % Cum Max Kind
    0 1567864 38 66895512 46 46 52 tuple
    1 285704 7 24972909 17 64 226 str
    2 1142424 28 20757800 14 78 20 bzrlib._static_tuple_c.StaticTuple
    ...
    8 2 0 1832 0 99 1684 FIFOCache
    9 35 0 1120 0 99 32 _InternalNode
  9. So about 136MB out of 440MB is reachable from this dict. However, I'm noticing that FIFOCache and _InternalNode is also reachable, and those don't really seem to fit. I also notice that there are 1.6M tuples here, which is often a no-no. (If we are going to have that many tuples, we probably want them to be StaticTuple() because they use a fair amount less memory, can be interned, and aren't in the garbage collector. So lets poke around a little bit
    >>> bigd[0]
    bzrlib._static_tuple_c.StaticTuple(408433296 20B 2refs 9par)
    >>> bigd[1]
    tuple(618390272 44B 4refs 1par)
    >>> pp(bigd[0].c)
    [str(40127328 80B 473par 'svn-v4:138bc75d-0d04-0410-961f-82ee72b054a4:trunk:126948'),
    str(247098672 85B 37par '14@138bc75d-0d04-0410-961f-82ee72b054a4:trunk%2Fgcc%2Finput.h')]
    >>> pp(bigd[1].c)
    [tuple(618383880 36B 2refs 1par),
    bzrlib._static_tuple_c.StaticTuple(569848240 16B 1refs 3par),
    NoneType(505223636 8B 1074389par),
    tuple(618390416 48B 5refs 1par)]
    One thing to note, dict references are [key1, value1, key2, value2] while tuple references are (last, middle, first). I don't know why tuple.tp_traverse traverses in reverse order, but it does. And StaticTuple followed its lead.
    The things to take away from this is
    1. It is mapping a StaticTuple(file_id, revision_id) => tuple()
    2. The target tuple is actually quite complex, so we'll have to dig a bit deeper to figure it out.
    3. The file-id and revision-id are both referenced many times (37 and 473 respectively), so we seem to be doing a decent job sharing those strings.
  10. At this point, I would probably pull up the source code for _BatchingBlockFetcher, to try and figure out what is so big for locations. Looking at the source code, it is actually built in _get_remaining_record_stream as:
    locations = self._index.get_build_details(keys)
    This is then defined as returning:
      :return: A dict of key: (index_memo, compression_parent, parents, record_details).
  11. And the index memo contains a reference to the indexes themselves, but they don't really 'own' them. So lets filter them out:
    >>> indexes = om.get_all('BTreeGraphIndex')
    >>> om.summarize(bigd, excluding=[o.address for o in indexes])
    Total 3740667 objects, 6 types, Total size = 122.9MiB (128855911 bytes)
    Index Count % Size % Cum Max Kind
    0 1567860 41 66895360 51 51 48 tuple
    1 189162 5 19690647 15 67 226 str
    2 948160 25 17261048 13 80 20 bzrlib._static_tuple_c.StaticTuple
    3 1 0 12583052 9 9012583052 dict
    4 1035483 27 12425796 9 99 12 int
    5 1 0 8 0 100 8 NoneType
    (It is currently a bit clumsy that you have to do [o.address], but it means you can use large sets of ints. I'm still trying to sort that out.)
    The memory consumption here looks more realistic. You can also see that just the tuple objects by themselves consume 67MB, or 51% of the memory. You can also see that for a dict holding 500k entries, we have 1.5M tuples. So we are using 3 tuples per key.
  12. Note that we can't just use StaticTuple here, because index_memo[0] is the BTreeGraphIndex. Digging into the code, I think the data is all here:
     result[key] = (self._node_to_position(entry),
    None, parents, (method, None))
    You can see that there is a whole lot of 'None' in this, and we also have an extra tuple at the end which is a bit of a waste (vs just inlining the content). We could save 28 bytes/record (or 28*500k = 14MB) by just inlining that last (method, None). Though it changes some apis.
  13. Another thing to notice is that if you grep through the source code for uses of 'locations', you can see that we use the parents info and the index_memo, but we just ignore everything else. (method, compression_parent, and eol info are never interesting here). So really the result could be:
    result[key] = (self._node_to_position(entry), parents)
    This would be 28 + 4*2 = 36 vs (28+4*4 + 28+4*2) = 80, or saving 44b/record*.5M = 22MB. That is about 20% of that 122MB. Which isn't huge, but isn't a lot of effort to get. We could get a little better if we could collapse the node_to_position info along side the parents info, etc. (Say with a custom object.) That could shave another 28 bytes for the tuple, and maybe one extra reference.
  14. I ended up working on this, because it was like a 10 minute thing. I ended up creating this class (code at lp:
    class _GCBuildDetails(object):
    """A blob of data about the build details.

    This stores the minimal data, which then allows compatibility with the old
    api, without taking as much memory.
    """

    __slots__ = ('_index', '_group_start', '_group_end', '_basis_end',
    '_delta_end', '_parents')

    method = 'group'
    compression_parent = None

    def __init__(self, parents, position_info):
    self._parents = parents
    self._index = position_info[0]
    self._group_start = position_info[1]
    # Is this _end or length? Doesn't really matter to us
    self._group_end = position_info[2]
    self._basis_end = position_info[3]
    self._delta_end = position_info[4]

    def __repr__(self):
    return '%s(%s, %s)' % (self.__class__.__name__,
    self.index_memo, self._parents)

    @property
    def index_memo(self):
    return (self._index, self._group_start, self._group_end,
    self._basis_end, self._delta_end)

    @property
    def record_details(self):
    return static_tuple.StaticTuple(self.method, None)

    def __getitem__(self, offset):
    """Compatibility thunk to act like a tuple."""
    if offset == 0:
    return self.index_memo
    elif offset == 1:
    return self.compression_parent # Always None
    elif offset == 2:
    return self._parents
    elif offset == 3:
    return self.record_details
    else:
    raise IndexError('offset out of range')

    def __len__(self):
    return 4
  15. The size of this class is 48 bytes, including the python object and gc overhead. This replaces the tuple(index_memo_tuple(index, start, end, start, end), None, parents, tuple(method, None)). Which is 28+4*4 + 28+4*5 + 28+4*2 = 128 bytes. So we save 80 bytes per record. on my bzr.dev repository that is ~10.6MB, on this dump it would be 40MB.
  16. The other bit to look at is measuring real-world results. Which looks
    something like this:
    >>> from bzrlib import branch, trace, initialize; initialize().__enter__()

    >>> b = branch.Branch.open('.')
    >>> b.lock_read()
    LogicalLockResult(
    /2.3-gc-build-details/)>)
    >>> keys = b.repository.texts.keys()
    >>> trace.debug_memory('holding all keys')
    WorkingSize 33192KiB PeakWorking 34772KiB holding all keys
    >>> locations = b.repository.texts._index.get_build_details(keys)
    >>> trace.debug_memory('holding all keys')
    WorkingSize 77604KiB PeakWorking 87960KiB holding all keys
    >>>
Hopefully this has been informative. Digging into a bit of memory consumption, and how to determine where memory is being consumed, and a bit of understanding about how you can rework python objects to save a bit of memory (the biggest thing is to try to use fewer objects overall, since every object is at least 24 bytes, and that is if you are using __slots__. If you aren't then it is a minimum of 172 bytes (32 for the base object + 140 for its __dict__).

by noreply@blogger.com (jam) at August 04, 2010 09:50 PM

Mark Shuttleworth

Making room in the sound indicator

In Maverick we’re adding the new Ayatana indicator for sound, Conor Curran’s very classy implementation of MPT’s very classy spec. It’s a Category Indicator, like the messaging menu, so it allows apps to embed themselves into it in a standard and appropriate way. You can have multiple players represented there, and control them directly from the menu, without needing a custom AppIndicator or windows open for the player(s). The integration with Rhythmbox and, via the MPRIS dbus API, several other players is coming along steadily.

One issue I’ve noticed is that the layout of the track and album art means we are almost always ellipsizing some of the track / album /artist data. I wondered whether it wouldn’t be reasonable to lay the metadata over the album art, if one used a drop shadow to ensure a more readable text:

Here’s the current layout:

Note the tight space for the track data, and hence the ellipsis

And here’s a GIMPfication, showing:

– the metadata right aligned,
– allowed to flow over the album art
– with a drop shadow to preserve contrast with the artwork

Metadata stretched over the artwork, with a drop shadow

And finally, I was a bit worried about the drop shadow over the non-art portion of the menu. It’s too different to anything else in the menu, so I cropped the shadows to limit them just to the area over the art:

Drop shadow is only used on the artwork

Clearly, this is only appropriate in the case where one has artwork. The metadata should stay left-aligned (and use the full width of the menu, something it doesn’t currently do) when there is no artwork.

Thoughts? I’m off to bed. Jetlagged, back from Debconf (lovely to see everyone again, if briefly).

by mark at August 04, 2010 01:00 AM

August 02, 2010

John Arbash Meinel

Meliae 0.3.0, statistics on subsets

Ah, yet another release. Hopefully with genuinely useful functionality.

In the process of inspecting yet another unexpected memory consumption, I came across a potential solution to the reference cycles problem.

Specifically, the issue is that often (at least in our codebases) you have coupled classes, that end up in a cycle, and you have trouble determining who "owns" what memory. In our case, the objects tend to be only 'loosely' coupled. In that one class passes off reference to a bound method to another object. However, a bound method holds a reference to the original object, so you get a cycle. (For example Repository passes its 'is_locked()' function down to the VersionedFiles so that they know whether it is safe to cache information. Repository "owns" the VersionedFiles, but they end up holding a reference back.)

What turned out to be useful was just adding an exclusion list to most operations. This ends up letting you find out about stuff that is referenced by object1, but is not referenced inside a specific subset.

One of the more interesting apis is the existing ObjManager.summarize().

So you can now do stuff like:
>>> om = loader.load('my.dump')
>>> om.summarize()
>>> om.summarize()
Total 5078730 objects, 290 types, Total size = 367.4MiB (385233882 bytes)
Index Count % Size % Cum Max Kind
0 2375950 46 224148214 58 58 4194313 str
1 63209 1 77855404 20 78 3145868 dict
2 1647097 32 29645488 7 86 20 bzrlib._static_tuple_c.StaticTuple
3 374259 7 14852532 3 89 304 tuple
4 138464 2 12387988 3 93 536 unicode
...

You can see that there is a lot of strings and dicts referenced here, but who owns them. Tracking into the references and using om.compute_total_size() just seems to get a lot of objects that reference everything. For example:
>>> dirstate = om.get_all('DirState')[0]
>>> om.summarize(dirstate)
Total 5025919 objects, 242 types, Total size = 362.0MiB (379541089 bytes)
Index Count % Size % Cum Max Kind
0 2355265 46 223321197 58 58 4194313 str
...

Now that did filter out a couple of objects, but when you track the graph, it turns out that DirState refers back to its WorkingTree, and WT has a Branch, which has the Repository, which has all the actual content. So what is actually referred to by just DirState?
>>> from pprint import pprint as pp
>>> pp(dirstate.refs_as_dict())
{'_bisect_page_size': 4096,
...
'_sha1_file': instancemethod(34050336 40B 3refs 1par),
'_sha1_provider': ContentFilterAwareSHA1Provider(41157008 172B 3refs 2par),
...
'crc_expected': -1471338016}
>>> pp(om[41157008].c)
[str(30677664 28B 265par 'tree'),
WorkingTree6(41157168 556B 35refs 7par),
type(39222976 452B 4refs 4par 'ContentFilterAwareSHA1Provider')]
>>> wt = om[41157168]
>>> om.summarize(dirstate, excluding=[wt.address])
Total 5025896 objects, 238 types, Total size = 362.0MiB (379539040 bytes)


Oops, I forgot an important step. Instances refer back to their type, and new-style classes keep an MRU reference all the way back to object which ends up referring to the whole dataset.
>>> om.remove_expensive_references()
removed 1906 expensive refs from 5078730 objs

Note that it doesn't take many references (just 2k out of 5M objects) to cause these problems.
>>> om.summarize(dirstate, excluding=[wt.address])
Total 699709 objects, 19 types, Total size = 42.2MiB (44239684 bytes)
Index Count % Size % Cum Max Kind
0 285690 40 20997620 47 47 226 str
1 212977 30 8781420 19 67 48 tuple
2 69640 9 8078240 18 85 116 set
...

And there you see that we have only 42MB that is directly referenced from DirState. (still more than I would like, but at least it is useful data, rather than just saying it references all objects).

I'm not 100% satisfied with the interface. Right now it takes an iterable of integer addresses. Which is often good because those integers are small and shared, so the only cost is the actual list. Taking objects requires creating the python proxy objects, which is something I'm avoiding because it actually requires a lot of memory to do so. (Analyzing 10M objects takes 1.1GB of peak ram, 780MB sustained.)

by noreply@blogger.com (jam) at August 02, 2010 05:47 PM

Mark Shuttleworth

Healing old wounds

Greg, thank you for your sincere and gracious apology.

When one cares deeply about something, criticism hurts so much more. And the free software world is loaded with caring, which is why our differences can so easily become vitriolic.

All of us that work on free software share the belief that our work has meaning far beyond the actual technology we produce. We are working to achieve goals that transcend the merits of the specific products we build: putting software freedom on a firm economic footing means that it can realistically become the de facto standard way that the software world works, carried forward by powerful forces of investment and return and less dependent on what feels like the heroic efforts of relatively few software outsiders swimming against the tide.

Red Hat’s success in proving a viable business model around a distribution was a very significant milestone in that quest, for all of us. I don’t mean to diminish that achievement when I point out that it’s come at the cost of dividing the world into those that buy RHEL, and those that can’t or won’t. Red Hat’s success is well deserved, and our work at Canonical is not in any sense motivated by desire to take that away. Red Hat is here to stay, there will always be a market for the product, and as a result, we all have the reassurance that our contributions can find a sustainable path into the hands of at least part of the world’s population.

Canonical’s mission is to expand the options, to find out if it’s possible to have a sustainable platform without that dividing line. We know that our quest would not be possible without your pioneering, but we don’t feel that’s riding on anybody’s coat-tails. We feel we have to break new ground, do new things, add new ingredients, and all of that is a substantial contribution in turn. But we don’t do it because we think Red Hat is “wrong”, and we don’t expect it to take anything away from Red Hat at all. We do it to add to the options, not to replace them.

We should start every discussion in free software with a mutual reminder of the fact that we have far more in common than we have differences, that individual successes enrich all of us far more in our open commons-based economy than they would in a traditional proprietary one, that it’s better for us to find a way to encourage others to continue to participate even if they aren’t necessarily chasing exactly the same bugs that we are, than to chastise them for thinking differently.

On that note, let’s shake hands.

Mark

by mark at August 02, 2010 01:26 PM

Jonathan Lange

unittest API, part 2

In part 1 of this humble attempt to document the interfaces and contracts that unittest actually cares about, we talked about TestSuite and TestCase, how they both implement a common interface that's used for running tests, ITest and how they each implement their own interfaces, ITestSuite and ITestCase.

Now we're moving on to a much more complicated object, TestResult, to see how we can pick apart the ways it interacts with the rest of the system.

TestResult

A TestResult object is all about dealing with the results of tests, as you might expect. However, it doesn't generally represent a single test result. You could say it represents the results of a number of tests, but I don't think that's terribly helpful.

Better to think of a TestResult object as an event handler. A TestResult object receives events from a test run and then does something with them.

Just as TestCase has a two-faced nature, presenting one interface to the testing framework and another to test authors, so to TestResult can be thought of has having many interfaces:
  1. Its interface to a TestCase. This can be thought of as the test event handling interface
  2. A result querying interface, normally used by a test runner
  3. An interface for events that come from the test runner, the runner event handling interface.
  4. An execution control interface.
Note that the result querying interface and the runner event handling interface together make up the interface between the TestResult and test runner.

Let's start with the test event handling interface. The methods below are the interface between TestCase.run() and TestResult. (I guess TestCase.debug too, but no one cares about it).
startTest(test)
Called when test commences running. Although not enforced, it's impolite to provide any results for test before calling this.
stopTest(test)
Called when test is completely finished. Although not enforced, it's impolite to provide any more results for test after calling this, unless you call startTest(test) again first.
addSuccess(test)
Called when test has been shown to be successful. The default implementation does nothing.
addError(test, err)
Called when test raises an unexpected error. err is a tuple such as you might get from sys.exc_info(). Calling this method for the first time must change the result of wasSuccessful().
addFailure(test, err)
Called when test has failed one of its assertions. err is a tuple such as you might get from sys.exc_info().
The above interface is tightly coupled to the implementation of TestCase.run(). In particular, if you wish to add more kinds of results to your testing framework ("skip" results are a fairly common addition), then you must change both TestCase.run() and the TestResult interface.

If you do something like that, I recommend making sure that your modified TestCase can handle TestResult objects that do not provide the extensions to the interface that you need. One common way of doing this is to have the TestCase fall back to the primitive result types, e.g. "skip" might become "success" for a TestResult that doesn't know what skipping means.

Importantly, the interface between TestCase and TestResult has been fattened in Python 2.7.
addSkip(test, reason)
Called when test is skipped. reason is a string explaining why the test was skipped.
addExpectedFailure(test, err)
Called when test failed in a way that was expected. err is a tuple such as the one returned by sys.exc_info().
addUnexpectedSuccess(test)
Called when test was expected to fail, but didn't.
The following interface is a way of learning about test results after they have happened, the result querying interface, and is part of the contract between the test runner and the TestResult.
wasSuccessful()
If there have been no errors and no failures, return True. Return False otherwise.
testsRun
An integer that is the number of tests that have been run.
errors
A list of tuples of (test, error_message) for all of the tests with unexpected errors, where test is an ITestCase and error_message is a string suitable for display to humans, generally containing a traceback.
failures
A list of tuples of (test, error_message) for all of the failing tests, where test is an ITestCase and error_message is a string suitable for display to humans, generally containing a traceback.
And of course, Python 2.7 fattens this interface again to have the following:
skipped
A list of tuples of (test, reason) for all of the skipped tests, where test is an ITestCase and reason is a string suitable for display to humans, generally containing a traceback.
expectedFailures
A list of tuples of (test, error_message) for all of the tests that were expected to fail and failed in the manner they were expected to, where test is an ITestCase and error_message is a string suitable for display to humans, generally containing a traceback.
unexpectedSuccesses
A list of all of the tests that unexpectedly succeeded. Members of the list are ITestCases.
In Python 2.7, TestResult also extended its interface to the test runner beyond simple result querying and into allowing the test runner itself to send two very important events to the TestResult, behold the runner event handling interface:
startTestRun()
Called before any tests have been run. It is impolite to provide any test results before calling this.
stopTestRun()
Called after all the tests have finished running. It is impolite to provide any test results after calling this. A TestResult object is generally not expected to handle any events at all after this method has been called.
Some test runners rely on TestResults to use those events to display the results to the user. These runners frequently do not use the result querying part of the interface.

There is one more interface that TestResult implements: the execution control interface:
stop()
Signal that the execution of further tests should stop now. Sets shouldStop to True.
shouldStop
If True, then test execution should stop. TestSuite.run() should monitor this value and stop execution if ever it is True.
This interface is mostly used as a way of handling KeyboardInterrupts cleanly.

Summary

If you want your TestResult object to work with standard Python TestCase objects, or any TestCase objects that try to stick close to the standard, then you must provide the test event handling interface described above. If you are writing your own test framework or test runner, you care about this, because you want to run everyone's unit tests.

If you want your TestResult object to work with the standard Python test runner before Python 2.7, then you must provide the result querying interface. If you are using the standard Python test runner, you care about this. For Trial or testtools, you must provide the runner event handling interface. For anything else, I'm afraid you are on your own.

Always provide the execution control interface.

Comments

In this documentation, I've been trying to describe the various interfaces without inserting too much of my own opinion about their design. However, I think some commentary might actually help to make things easier to understand.

By providing a querying interface for TestResult to be used by a test runner, the original designers of unittest practically insisted that responsibility for displaying the results of a test run be split between two different classes. The TestResult takes care of displaying incremental feedback from the running tests and the test runner takes care of displaying the summary. You can see evidence of this design in Python 2.6's unittest.py, where there's a hidden _TextTestResult subclass which has extra methods that are called only by a special TextTestRunner.

The addition of startTestRun() and stopTestRun() mean that now a TestResult object can be fully in charge of displaying its results. As such, providing a query interface and exposing details like the list of test failures somewhat vestigial.

I'm less happy with this post than the previous one. As such your critique is even more welcome.

Still to come: the interface for test authors and just what is a test runner anyway?

Update: Remove ambiguity in expectedFailures description (see comments). Thanks Aaron.

by jml (noreply@blogger.com) at August 02, 2010 01:15 PM

July 31, 2010

Paul Hummer

If I Had One Wish...

...it would be that we would stop propagating the lie that DRM is for your benefit. I just saw this on Lulu's ebook page.

DRM Bad!

If I had a second wish, I might wish for world peace or something. Eliminating DRM comes first (it might even be a pre-requisite).

July 31, 2010 11:45 PM

July 30, 2010

Paul Hummer

Source Package Recipes: Are They More Addictive Than Crack Cocaine?

Yes. Yes they are.

I'm crap at packaging. My mind has never really grokked all the packaging things. Also, I make packages, and then they quickly get out of date as I continue to work on the upstream part, and so then there's this big hill to get over when I actually need to package it again.

Enter source package recipes. It's what Aaron Bentley and I have been working on for the past 6 months. Basically, take two bzr branches, put them together using a recipe, and make a source package (and eventually a binary package) out of them.

My first recipe got built into a binary package two days ago, but I just noticed it.

Now I want to package everything.

July 30, 2010 04:28 PM

Dear Ubuntu Community - Thank You

I plug my new gadgets in. They work out of the box.

I want to install and play with a new piece of software. I don't have to search the net for that software.

I want to write new code. I'm up and running pretty quickly.

I want to buy new music and put it on my music player. I can certainly do that.

I want to contribute back to my OS. Kind people help me do that.

I just wanted to take a minute and say thank you to everyone that works on Ubuntu, from helping new users to writing code to testing code and filing bugs. You make me forget about my OS enough to get my work done. Thank you.

July 30, 2010 04:12 PM

Mark Shuttleworth

Tribalism is the enemy within

Tribalism is when one group of people start to think people from another group are “wrong by default”. It’s the great-granddaddy of racism and sexism. And the most dangerous kind of tribalism is completely invisible: it has nothing to do with someone’s “birth tribe” and everything to do with their affiliations: where they work, which sports team they support, which linux distribution they love.

There are a couple of hallmarks of tribal argument:

1. “The other guys have never done anything useful”. Well, let’s think about that. All of us wake up every day, with very similar ambitions and goals. I’ve travelled the world and I’ve never met a single company, or country, or church, where *everybody* there did *nothing* useful. So if you see someone saying “Microsoft is totally evil”, that’s a big red flag for tribal thinking. It’s just like someone saying “All black people are [name your prejudice]“. It’s offensive nonsense, and you would be advised to distance yourself from it, even if it feels like it would be fun to wave that pitchfork for a while.

2. “Evidence contrary to my views doesn’t count.” So, for example, when a woman makes it to the top of her game, “it’s because she slept her way there”. Offensive nonsense. And similarly, when you see someone saying “Canonical didn’t actually sponsor that work by that Canonical employee, that was done in their spare time”, you should realize that’s likely to be offensive nonsense too.

Let’s be clear: tribalism makes you stupid. Just like it would be stupid not to hire someone super-smart and qualified because they’re purple, or because they are female, it would be stupid to refuse to hear and credit someone with great work just because they happen to be associated with another tribe.

The very uncool thing about being a fanboy (or fangirl) of a project is that you’re openly declaring both a tribal affiliation and a willingness to reject the work of others just because they belong to a different tribe.

One of the key values we hold in the Ubuntu project is that we expect everyone associated with Ubuntu to treat people with respect. It’s part of our code of conduct – it’s probably the reason we *pioneered* the use of codes of conduct in open source. I and others who founded Ubuntu have seen how easily open source projects descend into nasty, horrible and unproductive flamewars when you don’t exercise strong leadership away from tribal thinking.

Now, bad things happen everywhere. They happen in Ubuntu – and because we have a huge community, they are perhaps more likely to happen there than anywhere else. If we want to avoid human nature’s worst consequences, we have to work actively against them. That’s why we have strong leadership structures, which hopefully put people who are proven NOT to be tribal in nature into positions of responsibility. It takes hard work and commitment, but I’m grateful for the incredible efforts of all the moderators and council members and leaders in LoCo teams across this huge and wonderful project, for the leadership they exercise in keeping us focused on doing really good work.

It’s hard, but sometimes we have to critique people who are associated with Ubuntu, because they have been tribal. Hell, sometimes I and others have to critique ME for small-minded and tribal thinking. When someone who calls herself “an Ubuntu fan” stands up and slates the work of another distro we quietly reach out to that person and point out that it’s not the Ubuntu way of doing things. We don’t spot them all, but it’s a consistent practice within the Ubuntu leadership team: our values are more important than winning or losing any given debate.

Do not be drawn into a tribal argument on Ubuntu’s behalf

Right now, for a number of reasons, there is a fever pitch of tribalism in plain sight in the free software world. It’s sad. It’s not constructive. It’s ultimately going to be embarrassing for the people involved, because the Internet doesn’t forget. It’s certainly not helping us lift free software to the forefront of public expectations of what software can be.

I would like to say this to everyone who feels associated with Ubuntu: hold fast to what you know to be true. You know your values. You know how hard you work. You know what an incredible difference your work has made. You know that you do it for a complex mix of love and money, some more the former, others the more latter, but fundamentally you are all part of Ubuntu because you think it’s the most profound and best way to spend your time. Be proud of that.

There is no need to get into a playground squabble about your values, your ethics, your capabilities or your contribution. If you can do better, figure out how to do that, but do it because you are inspired by what makes Ubuntu wonderful: free software, delivered freely, in a way that demonstrates real care for the end user. Don’t do it because you feel intimidated or threatened or belittled.

The Gregs are entitled to their opinions, and folks like Jono and Dylan have set an excellent example in how to rebut and move beyond them.

I’ve been lucky to be part of many amazing things in life. Ubuntu is, far and away, the best of them. We can be proud of the way we are providing leadership: on how communities can be a central part of open source companies, on how communities can be organised and conduct themselves, on how the economics of free software can benefit more than just the winning distribution, on how a properly designed user experience combined with free software can beat the best proprietary interfaces any day. But remember: we do all of those things because we believe in them, not because we want to prove anybody else wrong.

by mark at July 30, 2010 12:32 PM

July 29, 2010

Jonathan Lange

unittest API, part 1

It's a little known fact, but unittest actually has an API.

This isn't the API that you deal with when you write tests, but rather an API that unittest itself uses when running tests. You could think of it as two interfaces: one for test frameworks and one for test authors. Both APIs are real, but both are poorly documented and often misunderstood or abused.

TestCase

An instance of TestCase represents a single test. What you think of as a single test is up to you, but most of the time it's a unit test.

A TestCase object must provide the following methods.

This first list of methods can be thought of as a single interface, which these blog posts will call ITest given the lack of any better name.
countTestCases()
A method that returns the number of test cases this represents. It should always return 1.
run(result=None)
Calling this method actually runs the test. result is a TestResult object. run must call result.startTest(self) when it commences running the test and result.stopTest(self) when it is finished. Between these calls it must call a method on result to signal the result of the test. run must never raise an exception, and its return value is ignored. If result is not provided, the TestCase is obliged to make one.
__call__(result)
Identical to run(result), provided for backwards compatibility.
debug()
Calling this method runs the test without collecting its results. It may raise exceptions. This method is rarely called by test frameworks.

The following methods are specific to individual test case objects. We call this interface ITestCase.

id()
Should return a string that uniquely identifies the test. For Python tests, the fully-qualified Python name works well. The uniqueness of the id is not enforced.
shortDescription()
Should return a string that describes the test. Many test frameworks use this value to display test results.
__str__
Should return a string that describes the test. Frequently the same as either shortDescription() or id(). Many test frameworks use this value to display test results.
There is also a second interface, one that matters to code that subclasses TestCase. We'll deal with that in a later post.

TestSuite

A TestSuite represents nothing more or less than a bunch of tests.

A TestSuite must provide the ITest interface described above, with the differences that you would expect from something that represents many tests: countTestCases returns the number of tests in the suite; run runs many tests and thus calls result.startTest and kin many times over; debug is the same and can explode anywhere.

One difference is that TestSuite.run must stop running tests as soon as it detects that result.shouldStop is true.

In addition, TestSuite implements the following interface, which I'm giving the completely arbitrary non-existent name of ITestSuite.
addTest(test)
Takes an ITest and adds it to the suite.
addTests(tests)
Takes an iterable of ITests and adds them to the suite. Normally equivalent to [suite.addTest(test) for test in tests].
__iter__
All test suites must be iterable. Iterating over a test suite yields ITests. These may differ from the ITests provided to addTest and addTests.
In later posts, I hope to document TestResult, the subclassing interface of TestCase and tell you exactly what I think about test loaders, test runners and the like.

I'm blogging this partly because I don't know where else to write this up, but mostly because I need your help to make sure that I'm being clear and correct. Please comment with questions and corrections, and let me know if you find this at all helpful.

by jml (noreply@blogger.com) at July 29, 2010 06:56 PM

July 27, 2010

Elliot Murphy

getting things done requires a pipeline of projects

Even though I am a manager and practicing programmer/wannabe sysadmin, having a productive day like today requires proactively building and maintaining a pipeline of projects and network of experts much like a salesperson maintains a pipeline of deals and network of decision makers.

Today I emailed people who I have not spoken to in months or years, followed up on contracts, advised a colleague at work on how to solve some relationship issues on their project, attended a weekly project review call, spent a lot of time thinking about analytics and metrics, packaged and uploaded CouchDB 1.0 to Ubuntu, committed fixes to Debian SVN for CouchDB package, wrote/code reviewed/revised/deployed/tested a fix in production for ubuntuone.com error pages based on a mailing list discussion this morning, did a tech analysis/recommendation of how to implement a new feature, got new tires, walked outside and admired the large and inspiring moon with my daughter, reproduced a frustrating issue that has been stopping us from upgrading django on a project at work and blogged about it asking for help, ordered a new sticker for my laptop, consulted on a video installation at the church where I volunteer, fended off a Panopticon sales rep and got a useful recommendation in return, and deleted lots of emails and unread blog entries. I’m writing all this down because sometimes I don’t feel like I get enough done in a day, and writing it down gives me hope that things are actually getting done. None of this would have gotten accomplished today without the zillion invisible bits of work that went into that pipeline in the last 6 months.

I still need a haircut.

by Elliot Murphy at July 27, 2010 04:10 AM

Lazy test loading to deal with conflicting django settings

At work I have a bunch (ok, 3) different django projects in the same big code tree. Yes, I know we should split them up, thanks for pointing that out. Anyway, we are running python unit tests using the trial testrunner from twisted, because it’s very nice and we also have some twisted servers in this same code tree.

I have a problem with Django settings. There are some conflicting settings in the settings file used by different Django servers. The solution seems easy – run tests for each Django server in a separate subprocess. The excellent subunit library should do just the trick, it even has IsolatedTestSuite and IsolatedTestCase classes that take care of forking and running in a separate process.

Except this doesn’t work. Because when python modules are imported for test discovery, they also indirectly end up importing django.settings, and when the IsolatedTestSuite forks to run tests in a separate subprocess, that subprocess inherits the already polluted python environment that has the (sometimes wrong) django.settings imported already.

I am convinced that this must be solvable, but have been banging my head against it for a while and don’t understand unittest discovery well enough to solve it. I’ve created a self-contained little example that demonstrates the problem in isolation here: https://code.edge.launchpad.net/~statik/+junk/subunit-demo/

I will gladly endure your taunts if you teach me a solution.

by Elliot Murphy at July 27, 2010 03:42 AM

Jonathan Lange

Python 3

I would be much more sympathetic to the whole Python 3 endeavour if they had made a serious effort to keep the major 2.x releases mutually compatible.

by jml (noreply@blogger.com) at July 27, 2010 01:40 AM