Word Counts Example in Ruby and Scala

December 8th, 2009 by Zach Cox Leave a reply »

A while ago I was asked, as a pre-interview task for another company, to write some code in any language that counted word frequencies in these newsgroup articles. I recently came across the Ruby and Scala scripts I wrote and thought it would be fun to post them.

First, here is the wordcounts.rb script. It assumes the newsgroup files have already been downloaded and is executed by running ruby wordcounts.rb.

#Change rootDir to the location of your newsgroup files
rootDir = "/home/zcox/dev/20_newsgroups"
raise rootDir + " does not exist" unless File.directory? rootDir
 
#Iterates over all files under rootDir, opens each one and passes it to the block
def files(rootDir)
  Dir.foreach(rootDir) do |dir|
    if dir != "." && dir != ".."
      puts "Processing " + dir
      Dir.foreach(rootDir + "/" + dir) do |file|
        if file != "." && file != ".."
          open(rootDir + "/" + dir + "/" + file) do |f|
            yield(f)
          end
        end
      end
    end
  end
end
 
t1 = Time.now
counts = Hash.new(0) #0 will be the default value for non-existent keys
files(rootDir) do |file|
  file.read.scan(/\w+/) { |word| counts[word.downcase] += 1 }
end
 
puts "Writing counts in decreasing order"
open("counts-descreasing-ruby", "w") do |out|
  counts.sort { |a, b| b[1] <=> a[1] }.each { |pair| out << "#{pair[0]}\t#{pair[1]}\n" }
end
 
puts "Writing counts in alphabetical order"
open("counts-alphabetical-ruby", "w") do |out|
  counts.sort { |a, b| a[0] <=> b[0] }.each { |pair| out << "#{pair[0]}\t#{pair[1]}\n" }
end
 
t2 = Time.now
puts "Finished in " + (t2 - t1).to_s + " seconds"

The code just iterates over each file, splits each file into words and aggregates the word counts using a Hash object. It then sorts the Hash in two different ways and writes each to a separate file. I think Ruby makes this code very easy to read and Ruby’s core library support makes this task pretty simple to accomplish.

Next, here is the wordcounts.scala version. Again it assumes the newsgroup articles are already downloaded and can be run by executing scala wordcounts.scala.

//Change rootDir to the location of your newsgroup files
import java.io._
val rootDir = new File("/home/zcox/dev/20_newsgroups")
if (!rootDir.exists) throw new IllegalArgumentException(rootDir + " does not exist")
 
/** Iterates over all files under rootDir, opens each one and passes it to the function */
def files(rootDir: File)(process: File => Unit) {
  for (dir <- rootDir.listFiles; if dir.isDirectory) {
    println("Processing" + dir)
    for (file <- dir.listFiles; if file.isFile) {
      process(file)
    }
  }
}
 
val t1 = System.currentTimeMillis
var counts = Map.empty[String, Int].withDefaultValue(0)
files(rootDir) { file => 
  file.split("""\W+""").foreach { word => counts = counts(word.toLowerCase) += 1 }
}
 
println("Writing counts in decreasing order")
write(counts, "counts-descreasing-scala") {_._2 > _._2}
 
println("Writing counts in alphabetical order")
write(counts, "counts-alphabetical-scala") {_._1 < _._1}
 
val t2 = System.currentTimeMillis
println("Finished in " + ((t2 - t1)/1000.0) + " seconds");
 
/** Writes the specified map to the specified file in tab-delimited format, sorted accordingly. */
def write[K, V](map: Map[K, V], file: String)(sort: (Tuple2[K, V], Tuple2[K, V]) => Boolean) {
  using (new PrintWriter(new FileWriter(file))) { out => 
    map.toList.sort(sort).foreach { pair => out.println(pair._1 + "\t" + pair._2) }
  }
}
 
/** Converts a File to a String. */
implicit def file2String(file: File): String = {
  val builder = new StringBuilder
  using (new BufferedReader(new FileReader(file))) { reader => 
    var line = reader.readLine
    while (line != null) {
      builder.append(line).append('\n')
      line = reader.readLine
    }
  }
  builder.toString
}
 
/** Performs some operation on the specified closeable object and then ensures it gets closed. */
def using[Closeable <: {def close(): Unit}, B](closeable: Closeable)(getB: Closeable => B): B = 
  try {
    getB(closeable)
  } finally {
    closeable.close()
  }

Notice that Scala code is statically typed and compiled software that can be run as a simple script – Scala really does scale from small one-off scripts to large enterprise systems. It mirrors the Ruby script fairly closely, with the addition of some helper functions at the end to make up for some features that the core Scala library doesn’t provide. Overall I think it’s just as readable as the Ruby code, but comes with static type-checking and it runs on the JVM.

I have included some timing info as well. On my 2GHz Core2 Duo with 3GB RAM, the Ruby script averages about 60 seconds while the Scala script averages about 30 seconds. The Scala version uses immutable Map objects to store the word counts; thus a new Map object is created for each word. I switched it to a mutable Map (which required only minor code changes) and it dropped to about 15 seconds. Hardly scientific, I’m sure a lot of optimizations could be done on both scripts, but this shows the performance gains provided that can be achieved on the JVM using Scala.

If anyone can spot any improvements to make to these scripts, please share them with comments!

Advertisement

16 comments

  1. Hey,

    Thats a fun exercise and it was great seeing Ruby and Scala so directly contrasted. I’ve done a Clojure version as well:

    http://www.bestinclass.dk/index.php/2009/12/clojure-vs-ruby-scala-transient-newsgroups/

    Keep up the blogging,
    Lau

  2. Shot says:

    I’d seriously argue the following is more rubyish – with the idiomatic.rb version being the more, well, idiomatic one, while the efficient.rb version being more efficient by not having to flatten the per-file word arrays: http://gist.github.com/257164

  3. Jon Harrop says:

    I have written an F# solution that both short and efficient at the same time:

    http://fsharpnews.blogspot.com/2009/12/zach-cox-word-count-challenge.html

  4. Mike McNally says:

    If I got to use the “language” of my choice, I think I’d type in a quick shell pipeline to find the regular files, use xargs to run sed and strip out the words, tr to normalize alpha case, awk to count the occurrences, and sort to order the results.

  5. Hossam Karim says:

    I believe this is typical MapReduce problem

  6. karakfa says:

    This is a favorite question of mine as well, but I expect the developer to be pragmatic and provide me a scripting solution that can be written and run in less than 5 minutes.

    // cat individual files into all.txt

    and do the rest in awk one liner

    awk ‘{for(i=1;i freqsorted.txt

  7. karakfa says:

    My code got eaten in the previous comment. I posted on my blog. Here it is: http://karakfa.blogspot.com/2009/12/counting-word-frequencies.html

  8. John says:

    I implemented both a Python version and a Factor version, for comparison.

    http://re-factor.blogspot.com/2009/12/counting-word-frequencies.html

  9. Zach Cox says:

    Thank you to everyone for your comments and follow-up posts! When writing this post I hoped it would spark creative discussion and am glad that’s what happened.

    Some additional background on the original problem: the task was stated basically as “in your job as a developer, you’re given a one-off task one day where you need to produce a report of the word frequencies in these newsgroup files, write some code in whatever language you want.” So my intentions with this post were never to say “language X is better than language Y” and my intention for including timing data was never to prove “language X is faster than language Y”; I just wanted to post my particular solutions to the problem and hopefully encourage others to provide constructive criticism on my solutions and provide solutions of their own in other languages. Studying other developers’ code is a great way to learn, and that’s why I made this post.

    I learned a lot from viewing the code submitted in comments to this post as well as to the other follow-up posts, and revised my original code:
    Ruby: http://gist.github.com/268579
    Scala: http://gist.github.com/268580

    I decided to remove the timing data because I don’t want this to become a performance test. I also removed the “status update” lines, since we now know these scripts execute fast enough to make that code unnecessary.

    The Ruby version now uses Dir[] as it should have before instead of the ridiculous directory-traversal method; I definitely rushed too fast on this part before and should have looked for something like Dir[] in Ruby core.

    In the Scala version I condensed the use of the files method into one line, used underscore to represent the file provided to the first block, but couldn’t seem to get underscore to work for word in 2nd block, not sure why.

    Regarding the Scala LoC count: only the first 15 lines are problem-specific, the remainder are 3 methods that probably should be in the core Scala library, and probably are in various 3rd party libs. I think those 3 methods show a great power of Scala though, to extend the language so the main problem-specific code is more readable:
    – files method just lets you say “do this to each file under this directory”: files(dir) { file => //use file }
    – file2String lets you use a File as a String: file.split(regex) <== split is a method of class String, not File
    – using is a new control structure that safely closes any object with a close() method: using(dbConnection) {…}, using(reader) {…}, using(writer) {…}, etc.

    Again, thanks to everyone for the comments & follow-ups, and let's continue to learn from one another!

  10. Virasak says:

    Hi, I am a Scala beginner and I try to clean your Scala code to be more readable. Here is a result
    http://gist.github.com/265899#file_word_freq.scala

  11. Brianary says:

    And in Perl (functional style):

    perl -lne ‘map{$c{$_}++}(/(\w+)/g);END{map{print}(sort keys%c)}’ **/*.txt

  12. templates says:

    I simply couldn’t go away your web site prior to suggesting that I really loved the usual info an individual supply to your guests? Is going to be back incessantly to check up on new posts

Leave a Reply

*