Python owns us

Saturday, August 03, 2002

Python list.sort() now stable and faster

In case you missed it, Tim Peters of core Python team has been re-implementing the sort() method for lists. The whole story is quite lengthy and details of the new algorithm are rather hairy, but here is the outline, which is a crude simplification of the issues, refer to the links if you really want to know what was going on. Main reasons for re-implementing were the fact that there are advanced, ie. faster, algorithms for sorting data that has some preordering in it and the fact that list.sort() isn't stable which means that equal items may or may not be in unchanged order after sorting.

Stableness can be acquired by a so called decorate-sort-undecorate transform. This Python cookbook recipe by A. Martelli shows how to do it. This introduces some extra memory burden, though.

It all started Tim wondering out loud whether to re-implement list.sort(). Significant amount of research has been done on sorting since the previous implementation of Python sort. Keyword is adaptivity which works well on all sorts of non-random data. Real world data tends not to be purely random. Then he came up with an implementation what he calls non-recursive adaptive stable natural mergesort / binary insertion sort hybrid, something not usually taught at basic computer science algorithm course <wink>, which now will be referred to as "timsort".

The initial implementation was fine with many aspects, but there were two annoyances, namely the fact that the new sort introduces 2*N extra bytes memory burden and that with a specific data profile timsort was significantly slower. This data consisted of many duplicates and it was around four times slower with timsort than with the current sort implementation.

There were couple of suggestions that the list object could grow a new method, something like stable_sort(), but Tim quickly replied that the current sort implementation is one of the most complicated algorithms and integrating another large mass of code would make it more difficult to maintain.

In between all this timbot dropped a reference to the original paper on samplesort [FrM70] (which I hope to find from our universitys library on monday). The largest array they tested it on was 50 000 items what is such a tiny case nowadays that even in Python it is hard to time it accurately.

Tim implemented also the Edelkamp and Stiegeler's "Next-to-m" refinement of weak heapsort, but the first refinement of their algorithm proved to be so far from promising that he dropped it.

Eventually the tuning got into point where timsort for troublesome sample data was at the same level or even faster than the current sort, so the speed issue went practically away. Also, an elf revealed to Tim that Perl is going for a stable mergesort, too. In the Perl source code there was a reference to a paper by Peter McIlroy [McI93] which proved to be very essential but mysteriously quite unknown and uncited.

Now, remaining mystery was still cross-platform performance. A bunch of python-dev'ers were quick to volunteer to test it (for example here, here, here, a better record of the tests is at sf.net patch manager). It all proved to be very promising and Tim finally declared timsort to be basically done.

On a seemingly unrelated puzzle Tim ran timsort on python-dev archive and found out that there was 32% speedup. McIlroy noted similar behavior with PostScript and C source files. Now, the total number of comparisons in these cases is less than log2(N!) which just can't happen if the input data is random ordered. This is bit odd since you wouldn't think there's significant pre-existing lexicographic order in a file like that, says uncle Timmy.

There was some fiddling with some real-world data and the results were delightful.

So, at 31.7.2002, the new timsort was checked in and the patch was closed. Detailed description of the algorithm is a file called listsort.txt in the CVS. The implementation is in the listobject.c file. Have fun figuring it out <wink>.

Oh, Timsort is now in Jython, too.

[FrM70] Frazer, W. D., McKellar, A. C.,
Samplesort:  A Sampling Approach to Minimal Storage Tree Sorting.
JACM, Vol. 17, No. 3, July 1970

[McI93] McIlroy, P.,
Optimistic sorting and information theoretic complexity.
SODA (Fourth Annual ACM-SIAM Symposium on Discrete Algorithms), pp
467-474, Austin, Texas, 25-27 January 1993.

(Oh, this turned out to be a rather lengthy piece after all.)

posted by Jarno Virtanen at 22:10
permanent link to this article

Thursday, August 01, 2002

Number of programming language books released per year

I have put an initial version of a yet another plotting script up. This one plots the amount of books released (between years 1982-2001) per year for a programming language according to Amazon. Have fun.

posted by Jarno Virtanen at 13:46
permanent link to this article

Wednesday, July 31, 2002

I am still alive

But the heat is melting my brains. This seems like a good excuse to nit-pick about diventomark's new look. I find the new look, on the computers I use, annoying. Sure, the computer I am using probably sucks and its administrator even more so, but I have no desire to fine tune them myself, because they are not my computers. OK, the easiest solution is just not to read it, no harm done.

On a quasi-related note, I find his pyamazon module quite satisfying and inspirational. Once I manage to wrap things up, there is going to be some further plotting madness up.

Minor setback was the fact that the following queries fail, because the resulting XML documents do not validate according to xml.sax:

amazon.searchByKeyword('C Programming', page=20)
amazon.searchByKeyword('C Programming', page=44)

posted by Jarno Virtanen at 16:51
permanent link to this article