Far More Than Everything You've Ever Wanted to Know About Sorting


In comp.lang.perl, Randal Schwartz <merlyn@stonehenge.com> obfuscates:

 >>>> "Hugo" == Hugo Andrade Cartaxeiro <hac@nexus.inesc.pt> writes:
 :
 :Hugo> print $str;
 :Hugo> eir      11   9   2    6    3    1     1  81%  63%    13
 :Hugo> oos      10   6   4    3    3    0     4  60%  70%    25
 :Hugo> hrh      10   6   4    5    1    2     2  60%  70%    15
 :Hugo> spp      10   6   4    3    3    1     3  60%  60%    14
 :
 :Hugo> and I like to sort it with the last field as the order key. I know
 :
 :Hugo> perl has some features to do it, but I can't make 'em work properly.
 :
 :Well, speaking Perl with a Lisp.... :-)

We all recall that Perl est un esprit LISP dans un corps C, n'est-ce pas? :-) (``corps C'' sounds like ``corset'' in French.) [Philippe Verdret <phd@eurolang.fr>]

CODE 0.0

:require 5; # if you don't have perl 5 by now, why NOT? :-) :$str = : join "\n", : map { $_->[0] } : sort { $a->[1] <=> $b->[1] } : map { [$_, (split)[-1]] } : split /\n/, : $str; : :Hugo> Reply by mail (if any). : :(oops :-)

Oh for cryin' out loud, Randal! You expect a new perl programmer to make heads or tails of that>? :-) You're postings JAPHs for solutions, which isn't going to help a lot. You'll probably manage to scare these poor people away from the language forever? :-)

BTW, you have a bug.

Couldn't you at least indent it properly? :-) This might help folks make more sense of it. I'm going to use the

map EXPR,LIST

rather and then

map BLOCK LIST

of map here so that it looks more like a conventional subroutine call (although I can't do that with sort):

CODE 0.1

sort( { $a->[1] => $b->[1] } map( [ $_, (split(' ', $_))[-1] ], split(/\n/, $str) ) ) ) );

Hm... ok, not much better. Let's do this slowly and work our way through it. First of all, let's look at your string again.

eir 11 9 2 6 3 1 1 81% 63% 13 oos 10 6 4 3 3 0 4 60% 70% 25 hrh 10 6 4 5 1 2 2 60% 70% 15 spp 10 6 4 3 3 1 3 60% 60% 14

It looks like it's got a bunch of lines in it, so the first thing we're going to do is break it up into a list of lines.

@lines = split (/\n/, $str);

You need to do this because each the perl built-in sort() function expects a list and returns a list. It works like this in its most simple incarnation.

@newlist = sort @oldlist;

Well, that's all well and good, but it's going to do the default sort, meaning it'll sort it by ASCII. Imagine a function like this:

CODE 1.1

sub by_last_num_field { @alist = split(/\s+/, $a); @blist = split(/\s+/, $b); if ( $alist[$#alist] > $blist[$#blist] ) { return 1; } elsif ( $blist[$#blist] > $alist[$#alist] ) { return -1 } else { return 0; } }

You could then say

@newlist = sort by_last_num_field @oldlist; And it would work out for you.

It turns out that the $a and $b are each of the two values to be compared in the original list. They're not going to passed in the way normal arguments are, which is $_[0], $_[1], etc.

That's work out ok, but it's pretty complicated looking. It turns out that for just a matter, perl has a couple of operators to help you: <=> for numbers and cmp for strings.

While normally we give split a regular expression as its first argument, there's one case where you might want to give it a string: if the first argument is just a single space, then it will automatically discard leading null fields, which would otherwise be preserved. (Trailing null fields are discarded in any case (unless you give split() a third argument).)

So you could just do this:

CODE 1.2

sub by_last_num_field { @alist = split(' ',$a); @blist = split(' ',$b); return $alist[$#alist] <=> $blist[$#blist]; }

Now you could also have written that function ``in-lined'', like so. I'm going to leave out the word return this time:

CODE 1.3

@newlist = sort { @alist = split(' ', $a); @blist = split(' ', $b); $alist[$#alist] <=> $blist[$#blist]; } @oldlist;

Ok so far? We can make use of perl5's ability to do negative subscripting to get the last element:

CODE 1.4

@newlines = sort { @alist = split(' ', $a); @blist = split(' ', $b); $alist[-1] <=> $blist[-1]; } @oldlines;

And in fact, we can do this without the temporary lists:

CODE 1.5

@newlines = sort { (split(' ', $a))[-1] <=> (split(' ', $b))[-1]; } @oldlines;

But no matter how you write it, that's a wee tad expensive for large data sets. Since Perl's sort is a hook into C's qsort (lookup qsort(3) in the on-line Unix manual, if you're daring), that makes it N-logN sort. You're going to be doing 2 splits for each of those. With your data set of four, this isn't two bad: you'll average only 11 splits. With ten lines though, you're already up to 46. With a mere one hundred lines, 921 splits, and with a thousand lines, we're talking 13,816 splits on the average! That's way too many.

So let's arrange to do just N splits. We'll have to pre-process them:

CODE 2.1

@last_field = (): foreach $line (@oldlines) { $last_elt = (split(' ', $line))[-1]; push (@last_field, $last_elt); }

Now for a given line $n in @oldlines , its last field can be found in the corresponding element of the @last_field array, that is, $last_field[$n] .

You could also just use the return value from split directly, without using a temporary, like this:

CODE 2.2

@last_field = (): foreach $line (@oldlines) { push (@last_field, (split(' ', $line))[-1]); }

You'll find that many perl programs are used to letting the $_ variable be the default to quite a few different operations. This is unsettling to the newcomer, but it's really not that bad. Amongst other things, $_ is the default iterator variable on a foreach loop, the default string to split on in a split() function, as well as the default variable to search or replace in a match or substitution. (You'll find that it's also the implicit iterator variables in the map() and grep() built-ins, which are themselves something like short-cut ways of writing foreach loops.)

CODE 2.3

@last_field = (): foreach (@oldlines) { push(@last_field, (split(' '))[-1]); }

In perl5, we can write that without so many parentheses if we'd like. This time I'm also going to spell "foreach" as "for". It actually doesn't matter to perl, but it might to the reader.

CODE 2.4

@last_field = (): for (@oldlines) { push @last_field, (split ' ')[-1]; }

This entire process of reducing things to smaller and smaller amounts of code to do the equivalent thing is an interesting exercise, but perilous. While between 1.1 and 1.5 and between 2.1 and 2.4, we actually did improve the efficiency of the operation ever so slightly, we've also managed to make it harder to debug and harder to maintain. It's nice to have intermediary values sometimes in order to print them out part way through the calculation, whether in the debugger or with normal print statements. It's also harder for someone to come by later and figure out what you're doing. If your aim is job security, then this might work out in your favor, but in most other cases, it probably won't.

Ok, how do we couple the stuff we did in code set 2 with what we did in code set 1? Remembering that any given element in @last_field corresponds to the last field (the thing we want to sort on) in the @oldlines list. So we're going to do something a bit strange and sort not real data but rather just the indices. That is, we'll rearrange indices (like 0,1,2,3,4) to correspond to the properly sorted order (like 3,2,1,4,0). This is probably the hardest concept to grasp in this whole tutorial.

CODE 3.1

@orig_indices = 0 .. $#oldlines; @sorted_indices = sort { $last_field[$a] <=> $last_field[$b] } @orig_indices; @newlist = (); foreach $idx (@sorted_indices) { push(@newlines, $oldlines[$idx]); }

Now when we're done, we've got everything into its proper order. Occasionally, advanced perl programmers make use of what we call array slices. If you say

CODE: 3.2.1:

@old_idx = (1,2,3); @new_idx = (3,1,2); @x[@new_idx] = @y[@old_idx];

It's the same as though you'd said:

CODE 3.2.2

@x[3,1,2] = @y[1,2,3];

Which is really nothing more than simply:

CODE 3.2.3

$x[3] = $y[1]; $x[1] = $y[2]; $x[2] = $y[3];

Except that it's shorter to write. Notice how when you're doing slice operations you have to subscript your list with an @ not a $? That's because the $ or @ sign actually indicates what it is that you're getting back, not what you're looking at. In this case, you're assigning a (sub-)list to a (sub-)list, so you need the @ signs.

This means we can re-write 3.1 as

CODE 3.3

@orig_indices = 0 .. $#oldlines; @sorted_indices = sort { $last_field[$a] <=> $last_field[$b] } @orig_indices; @newlines = @oldlines[@sorted_indices];

Now, since sort will return the indices, we can avoid some temporaries and write that as:

CODE 3.4

@newlines = @oldlines[ sort { $last_field[$a] <=> $last_field[$b] } 0 .. $#oldlines ];

If we'd used shorter variable names, we could have then written it all on one line:

CODE 3.5

@new = @old[ sort { $f[$a] <=> $f[$b] } 0 .. $#old ];

Now what? Well, we have to get back the user's orginal data.

CODE 4.1

$str = join ("\n", @newlines);

But you'll find that you're shy a final newline (that's Randal's bug).

You could solve this by passing another null argument to join() ,

CODE 4.2

$str = join ("\n", @newlines, '');

or by concatenating a newline to the result:

CASE 4.3

$str = join ("\n", @newlines) . "\n";

Combining all these, we have final solution that looks like this:

CODE 5.1

@old = split (/\n/, $str); @f = (): for (@old) { push @f, (split ' ')[-1]; } @new = @old[ sort { $f[$a] <=> $f[$b] } 0 .. $#old ]; $str = join ("\n", @new) . "\n";

We can merge a few of these and have something shorter.

CODE 5.2

@old = @f = (): for (split (/\n/, $str)) { push @old, $_; push @f, (split ' ')[-1]; } $str = join ("\n", @old[sort{ $f[$a] <=> $f[$b] } 0..$#old ]) . "\n";

It's true tha this is verging on the inscrutable, but we're trying to get closer to Randal's obfuscated perl entry he presented way at the beginning of this.

Nonetheless, it's still quite a bit different from Randal's. One thing he's done is make use of the map() function. This is a perl5 feature that uses its first expression argument as code to execute repeatedly against each of its following list arguments, setting $_ to each element, and returning the result of the expression to create a new output list.

CODE 5.2.1

@single = 1..10; @treble = map $_ * 3, @single;

You may also write this expression as a block, and omit the comma, which is what Randal did:

CODE 5.2.2

@single = 1..10; @treble = map { $_ * 3 } @single;

Or even omit the temporary:

CODE 5.2.3

@treble = map { $_ * 3 } 1..10;

You can use the map() function to shorten up 5.2 a bit by pulling out the for loop:

CODE 5.3

@old = split (/\n/, $str); @f = map { (split ' ')[-1] } @old; $str = join ("\n", @old[sort{ $f[$a] <=> $f[$b] } 0..$#old ]) . "\n";

You could even make use of the result of the assignment statement as the argument to map and shrink it sill more:

CODE 5.4

@f = map { (split ' ')[-1] } @old = split (/\n/, $str); $str = join ("\n", @old[sort{ $f[$a] <=> $f[$b] } 0..$#old ]) . "\n";

Now, Randal didn't want to use any temporary variables at all. Let's look at his jumbled code again:

CODE 0.0

$str = join "\n", map { $_->[0] } sort { $a->[1] <=> $b->[1] } map { [$_, (split)[-1]] } split /\n/, $str;

He's been kinda nasty in that there are no parens or indentation. I offered this version which adds them:

CODE 0.1

$str = join("\n", map($_->[0], sort( {$a->[1] <=> $b->[1] } map( [$_, (split(' ', $_))[-1] ], split(/\n/, $str) ) ) ) );

But you could retain Randal's block form as well:

CODE 0.2:

$str = join("\n", map( { $_->[0] } sort( {$a->[1] <=> $b->[1] } map( { [$_, (split(' ', $_))[-1] ] } split(/\n/, $str) ) ) ) );

While in some ways code 0.2 resembles code 5.4, in many others it doesn't. What are all those crazy arrows and square brackets doing there? Well, that's a really interesting question.

Let's take this thing inside out. Here's the innermost portion of it:

CODE 6.1

map { [$_, (split(' ', $_))[-1] ] } split(/\n/, $str)

What's going on is that he's using the return from split to be the arguments to map() over. Ok, but what's that map() code block doing???

CODE 6.2

[ $_, (split(' ', $_))[-1] ]

What 6.2 does happens to do is to create a reference to a list of two elements. It's as though you said this

CODE 6.2.1

my @x = ( $_, (split(' ', $_))[-1] ); return \@x;

However, the list reference he's creating by using the [ .... ] notation isn't a named list at all. If you wrote the inner loop as a real honest-to-goodness loop, it might look like this: (remember that 6.3 is just the long form of 6.1)

CODE 6.3

@listrefs = (); for (split(/\n/, $str)) { $newref = [ $_, (split(' ', $_))[-1] ]; push (@listrefs, $newref); }

And then you'd have all your references in @listrefs. This means that for a given line within $str (call it line number $n ), $listrefs [$n]->[0] is going to be the whole line itself (without its newline), while $listrefs[$n]->[1] is going to be its last field. We could actually write those without the pointer arrows, because adjacent brackets may omit them, as in $listrefs[$n][0] , but we're going to need them for what later, so you should get used to seeing them.

Well, what's the next level do? It's a sort. Consider this:

CODE 7.1

@newrefs = sort { $a->[1] <=> $b->[1] } @listrefs;

Since $a and $b will be each element of @listrefs, they will each be references to lists (kinda like pointers to lists). Remember that each of these refers to a list of two elements, whose zeroth element $a->[0] is the entire data, but whose first element $a->[1] is the sort key. So we'll compare the sort keys.

Put we don't want the sort keys. We want the original data. You could write a loop to do it:

CODE 7.2

@data = (): foreach $ref (@newrefs) { push(@data, $ref->[0]); }

Or you could be like Randal and avoid temporaries by using a map() instead of a loop:

CODE 7.3

@data = map { $_->[0] } @newrefs;

Combining up various stages, we get this:

CODE 7.4

@lines = split( /\n/, $str); @listrefs = map { [$_, (split(' ', $_))[-1] ] } @lines; @newrefs = sort { $a->[1] <=> $b->[1] } @listrefs; @data = map { $_->[0] } @newrefs; $str = join("\n", @data) . "\n";

It may not look like it, but this is very, very close to what Randal wrote back in 0.0; he just didn't keep around his temporary results in names variables. It's quite likely that he even developed the 0.0 code starting from something that looked more like 7.4. If you don't, you'll have an exceedingly difficult time debugging it.

But you know something? You don't have to be all so super-clever to do this. There's a much more straightforward solution:

CODE 8.1

$tmpfile = "/tmp/sortage.$$"; open (TMP, ">$tmpfile"); print TMP $str; close(TMP); $str = `sort +10n $tmpfile`; unlink $tmpfile;

See? We just write it to a file, then let the system sort program do the work for you. Isn't that much easier to understand? If you can easily express your sort in something that the system sort program will understand, it usually makes more sense to use it. It'll run faster and be easier to write as well.

Of course, we've made committed some minor errors here by omitting error checking. This would be better:

CODE 8.2

$tmpfile = "/tmp/sortage.$$"; open (TMP, ">$tmpfile") || die "can't open $tmpfile for writing: $!"; (print TMP $str) || die "can't write to $tmpfile: $!"; close(TMP) || die "can't write to $tmpfile: $!"; $str = `sort +10n $tmpfile`; if ($?) { die "sort exited with a $?" } unlink($tmpfile) || die "can't unlink $tmpfile: $!";

That line to print() looks funny, eh? And what about the parens on the unlink? You probably thought that you could have written both of those as

CODE 8.2.1

print TMP $str || die "can't write to $tmpfile: $!"; unlink $tmpfile || die "can't unlink $tmpfile: $!";

But in fact, you couldn't. The || binds more tightly than you want. It would have been as though you'd written:

CODE 8.2.2

print TMP ($str || die "can't write to $tmpfile: $!"); unlink ($tmpfile || die "can't unlink $tmpfile: $!");

Which certainly isn't good. That's why in perl5 we have or, which is a very low-precedence operator. We probably would write it like this:

CODE 8.3

$tmpfile = "/tmp/sortage.$$"; open TMP, ">$tmpfile" or die "can't open $tmpfile for writing: $!"; print TMP $str or die "can't write to $tmpfile: $!"; close TMP or die "can't close $tmpfile: $!"; $str = `sort +10n $tmpfile`; if ($?) { die "sort exited with a $?" } unlink $tmpfile or die "can't unlink $tmpfile: $!";

Another trick sometimes used is to use to the filename itself as an indirect file handle.

CODE 8.4

$tmpfile = "/tmp/sortage.$$"; open $tmpfile, ">$tmpfile" or die "can't open $tmpfile for writing: $!"; print $tmpfile $str or die "can't write to $tmpfile: $!"; close $tmpfile or die "can't close $tmpfile: $!"; $str = `sort +10n $tmpfile`; if ($?) { die "sort exited with a $?" } unlink $tmpfile or die "can't unlink $tmpfile: $!";

Another trick sometimes used is to use to the filename itself as an indirect file handle. This has a greater advantage when you are reading from the file, and issuing diagnostics using warn() or die(), which include the line (or chunk) position and filehandle anme of the last file read from. This way you'd also get the

CODE 8.5

$tmpfile = "/tmp/sortage.$$"; open $tmpfile, ">$tmpfile" or die "can't open $tmpfile for writing: $!"; print $tmpfile $str or die "can't write to $tmpfile: $!"; close $tmpfile or die "can't close $tmpfile: $!"; # do something else to the file here, then... open $tmpfile, "<$tmpfile" or die "can't open $tmpfile for reading: $!"; while (<$tmpfile>) { if ( $something_or_other ) { warn "oops message"; } }

Now the warn will include the actual filename rather than just the filehandle, like TMP.

Now, you probably shouldn't bother go to too much trouble in figuring out a good filename. There is a POSIX function which will do this for you.

CODE 8.6

use POSIX; $tmpfile = POSIX::tmpnam(); open TMP, ">$tmpfile" or die "can't open $tmpfile for writing: $!"; print TMP $str or die "can't write to $tmpfile: $!"; close TMP or die "can't close $tmpfile: $!"; $str = `sort +10n $tmpfile`; if ($?) { die "sort exited with a $?" } unlink $tmpfile or die "can't unlink $tmpfile: $!";

Just one final thing. This seems like just the application for a bi-directional open. You might be tempted to write this using the open2() function.

CODE 9.1

use IPC::Open2; $kidpid = open2('INPUT', 'OUTPUT', 'sort +10n'); print OUTPUT $str or die "can't write to $tmpfile: $!"; close OUTPUT or die "can't close output pipe: $!"; while (<INPUT>) { print "got line $.: $_"; } close INPUT or die "can't close input pipe: $!";

or maybe to read the whole thing back again, undef the RS:

CODE 9.2

use FileHandle; use IPC::Open2; $kidpid = open2('INPUT', 'OUTPUT', 'sort +10n'); print OUTPUT $str or die "can't write to $tmpfile: $!"; close OUTPUT or die "can't close output pipe: $!"; INPUT->input_record_separator(undef); $str = <INPUT>; close INPUT or die "can't close input pipe: $!";

Now, it turns out that in this case it actually works, at least on my system. I wrote all my input do the pipe first, which gladly read it all in up until end of file, and then spit it out. That's how sort work. But cat isn't so friendly, and works far more like most programs.

CODE 9.3

use FileHandle; use IPC::Open2; $kidpid = open2('INPUT', 'OUTPUT', 'cat'); print OUTPUT $str or die "can't write to $tmpfile: $!"; close OUTPUT or die "can't close output pipe: $!"; INPUT->input_record_separator(undef); $str = <INPUT>; close INPUT or die "can't close input pipe: $!";

If you do this and $str is big enough, you'll block forever. And even doing a piece at a time:

CODE 9.4

    use IPC::Open2;
    $kidpid = open2('INPUT', 'OUTPUT', 'cat');
    while (1) {
	print OUTPUT $str;           
	$newstr =<INPUT>;
	print "just read <<< $newstr >>>\n";
    }

This approach won't work because in general, you can't control your child processes notions of buffering, which are entirely useless for simulating interactive work as you're attempting here. The only programs this works on are ones that are designed for it, like dc. There aren't many system programs like it.

I strongly advise against using open2 for almost anything, even though I'm its author. UNIX buffering will just drive you up the wall. You'll end up quite disappointed. It's better to use temp files as we did back in the code examples in section 9 if you really want to deal with an external process in which you wish to control both its input and its output.

There's one final lesson I'd like to impart. It's not a bad idea to order your coding priorities like so:

  1. correctness
  2. maintainability
  3. efficiency

Notice at no point did ``4. cleverness'' enter the picture. Of course, if you're writing Obfuscated C contest entries, or tricksy little JAPHs, then it makes perfect sense to invert these priorities, like so:

  1. cleverness
  2. efficiency
  3. maintainability
  4. correctness

But that's only good for tricks, not solid code, which is why code like that in 5.1, 8.3, or even 7.4 will in the long run make easier both your life and the lives of those around you.

While a LISP or Scheme programmer may be entirely conformatable with arrange their code as Randal did in 0.0, I haven't personal found that most of your average off-the-street UNIX, DOS, VMS, or IBM programmers take very quickly to that style of functional programming -- or even the above average ones, for that matter. It's nice in its own pristine way not to use any temporary variables and all, but it's still not to most people's liking.

And no, I'm not ragging on Randal, merely teasing a bit. He's just trying to be clever, and that's what he does. I'm just submitting a sample chapter for his perusal for inclusion the mythical Alpaca Book :-)

--tom


AUTHOR

Tom Christiansen <tchrist@mox.perl.com>
Original Date: 10 Apr 1995 12:30:06 GMT
Last Revision: 05 Oct 1995 06:56:06 MDT


Return to:
Copyright 1996 Tom Christiansen.
All rights reserved.