Solve The Real Problem: October 2007

I've started reading Beautiful Code and the first three chapters are just as I've expected. Verterans talking about some specific problems more to give you insight in to how they think than to help you fully understand the problem. They've been good.

Now, however, I've read the fourth chapter, by Tim Bray, and I'm embarrassed for him. The article reads as biased Ruby evangelism with religious reasons for why Ruby is great, in stark contrast to the previous chapters that focused on making good design and implementation choices. I don't dislike or know Tim, but his chapter in this book actually bothered me enough to write about it.

Tim's style of prose is a little too informal and the chapter feels like it is targeted at beginner programers. However, it's the content that really bothered me. For example, he states the following:

This example (and most of the other examples in this chapter) is in the Ruby programming language because I believe it to be, while far from perfect, the most readable of languages.
If you don't know Ruby, learning it will probably make you a better programmer. In Chapter 29, the creator of Ruby, Yukihiro Matsumoto (generally known as "Matz"), discusses some of the design choices that have attracted me and so many other programmers to the language.

He then follows that with his first example program, which he elaborates on later to explore the problem domain he has chosen: showing the ten most popular articles on his personal blog (which is of course referred to with its URL). The program he shows is:

1 ARGF.each_line do |line|
2   if line =~ %r{GET /ongoing/When/\d\d\dx/\d\d\d\d/\d\d/\d\d/[^ .]+ }
3     puts line
4   end
5 end

Tim puts himself in religous territory right away by declaring his belief that Ruby is not just a beautiful langauge or among beautiful languages, but is actually the most beautiful language. If this was an isolated statement, it would just be a poor choice of words, but much of the article is rife with Ruby praise. And, where Ruby lacks (such as the need for the Pascal-like verbose block terminator "end"), he acknowledges this, but in a dismissive way that makes you feel like he's waving his hands in front of the warts. I found this especially distracting since he had to hand-wave away two of the five lines of his first program example.

The start of the second paragraph in his program's preamble almost offended me. It seems very presumptuous for Tim to declare that I'm not as good a programmer as I could (should?) be because I don't know Ruby. The sentiment is compounded by the next sentence which smacks of "all the other kids are doing it". I'm prepared to be surprised and disturbed by facts and even anecdotes in a book like this, but not by judgements and peer pressure.

So, after presenting the reader with this five-line program in "the most readable of languages", Tim then takes an entire page to describe what it does, line by line. Clearly, he needs to explain some of the unique, custom syntax (both of those adjectives tend to fall outside of my personal beauty bucket, by the way) used by the program. I'm not sure why he's needed this line by line explanation if the language and program are so readable and beautiful. This isn't something any of the previous chapters' authors have felt the need to do, and one was performing an incremental complexity analysis of Quicksort. Ironically, the line numbers in the example are apparently added by Tim to aid readability.

Much of the praise in this early section of the chapter is devoted to regular expressions, and that is justified. There seems to be an implication that that praise is somehow attributable to Ruby, but this is probably more the fault of the mood set by Tim than it is that of the actual relevant text.

Notice also that this program is equivalent to:

egrep "GET /ongoing/When/\d\d\dx/\d\d\d\d/\d\d/\d\d/[^ .]+ "

Given the application domain of reporting on log files, grep seems a more suitable solution so far. Tim expands the example after an overly detailed and unnecessarily exotic explanation of associative data structures (which he refers to as "Content-Addressable Storage") to instead count each article reference. I think it's important to see his expanded example for context.

counts = {}
counts.default = 0

ARGF.each_line do |line|
  if line =~ %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) }
    counts[$1] += 1
  end
end

keys_by_count = count.keys.sort { |a, b| counts[b] <=> counts[a] }
keys_by_count[0 .. 9].each do |key|
  puts "#{count[key]}: #{key}"
end

From what I can tell, it becomes equivalent to:

egrep -o "/ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) " \
  |sort |uniq -c |sort -r -n |head -10

Approaches similar to the command line above seems acceptable to Tim because later when discussing more complex problems he talks about using multiple programs that produce intermediate files (although not in a pipeline) and do the processing as a series of separate Ruby programs.

Doing the same took another six lines of code in Ruby, and all of the useful syntax appears to be directly borrowed from Perl. In uncompressed first-draft Perl, you can write that as:

while(<>) { 
  ++$counts{$1} if m!GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) !; 
} 
@sorted = sort { $b->[1] <=> $a->[1] } map { [$_, $counts{$_}] } keys %counts; 
for $i (0..9) { 
  print "$sorted[$i]->[0] $sorted[$i]->[1]\n"; 
}

For my logs on my system (with a different regex, of course), the command line runs in ~2.4 seconds and the perl runs in about ~0.76 seconds. Tim's Ruby version processes my logs in ~2.0 seconds. My Perl took me longer to write than the command line, of course, but it didn't take more than a couple of minutes.

Tim says he wondered if his Ruby program was slow, so he wrote a Perl version to compare it to. But he doesn't include his Perl version in the article, so he doesn't draw any comparisons between the two implementations. Given his assertions about how much more readable Ruby is, I would like to have seen Tim do a direct comparison, especially since an alternative implementation was prepared for performance comparisons anyway. Its omission leaves his claims looking unfounded and tenuous.

The chapter does raise good points about when to spend time optimizing, and there are a few paragraphs that explicitly credit the "scanning lines of text with regexes" approach to awk, but they are in a different typeface and I wondered if they'd been inserted by the editors (but they aren't; they're written by Tim himself as a sidebar). The discussion of binary search and its Java implementation are just textbook cases with a small amount of advice, but nothing you wouldn't guess or know already. The chapter ends with an out-of-place discussion about the large Internet search engines that has nothing to do with design or code that I could see.

The point is that there is no great improvement in programmer efficiency, performance or readibility demonstrated here with the use of Ruby, so why does Tim focus on the language instead of the application domain and how its problems can be solved elegantly? Why did he chose such a simplistic problem that any experienced programmer can solve with a one-liner UNIX command or a few lines of a common log-processing language (Perl)? I don't like being fed an evangelist message that comes without respect or substance, but that's what much of this chapter feels like. Given Tim Bray's experience and expertise in information systems, I was looking forward to something really interesting and worth exploring.

Solve The Real Problem

Wednesday, October 17, 2007

Embarrassing Code

About Me

My Regular Reading

Previous Posts

Archives