The Weekly Source Code 9 - WideFinder Edition

October 22, 2007 Comment on this post [7] Posted in ASP.NET | Microsoft | Source Code

Sponsored By

In my new ongoing quest to read source code to be a better developer, I now present the ninth in an infinite number of a weekly series called "The Weekly Source Code." Here's some source I'm reading this week that I enjoyed.

Ya, I know this one is just 4 days after the last one, but I was having too much fun and couldn't wait until Wednesday. Plus, it's a new week so poop on you.

Last month Tim Bray shared his experiences writing a program that does, well, here's Joe Cheng's succinct description. Tim calls the project WideFinder.

The Wide Finder challenge is to write a program that:

Scans logfiles for hits on blog articles

which are counted and

sorted with the

top 10 most popular being printed to stdout. It should also

be about as elegant and concise as Tim’s Ruby version and

its performance should scale with additional CPU cores.

And this is done on a fairly large log file of about 250 megs. While Item #6 is the most interesting, many folks are focusing on Item #5. Either way, it's a heck of a lot more interesting problem than FizzBuzz and worth adding to your interview ~~arsenal~~ pocket.

I encourage you to go check out Tim's site as he's continued to list the sources that he finds most interesting. As a primarily C# programmer who's always trying to stretch out of my comfort zone, here's what I've found interesting, in the order I found them interesting.

Don Box's Naive Implementation in C# 3.0 - Apparently this is the kind of code Don can write after two beers. Notice the use of yield to make this "LINQ over a text file of CR/LF strings." That's one of those write-it-constantly-over-and-over-again helper methods that makes me wonder why it wasn't just included.

    static void Main(string[] args)
    {
        var regex = new Regex(@"GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+)");
 
        var grouped = from line in ReadLinesFromFile(@"C:\temp\bray.txt")
                      let match = regex.Match(line)
                      where match.Success
                      let url = match.Value
                      group url by url;
  
        var ordered = from g in grouped
                      let count = g.Count()
                      orderby count descending
                      select new { Count = count, Key = g.Key };
        
        foreach (var item in ordered.Take(10))
            Console.WriteLine("{0}: {1}", item.Count, item.Key);
    }
 
    // LINQ-compatible streaming I/O helper
    public static IEnumerable<string> ReadLinesFromFile(string filename)
    {
        using (StreamReader reader = new StreamReader(filename)) {
            while (true)
            {
                string s = reader.ReadLine();
                if (s == null)
                    break;
                yield return s;
            }
    }

Joe Cheng tightens it up with his LINQ skillz and does the group and sort all in one swell foop. As an aside, I'd like to see his QuickTimer class for next week. Nice use of my favorite C# idiom - IDisposable/using. Joe also alludes to some parallelism that could be easily added with PLINQ. Maybe we'll see that code soon.

using (new QuickTimer(“Total time”))
{
    IEnumerable<string> data = new LineReader(args[0]);

    Regex regex = new Regex(@”GET /ongoing/When/\d\d\dx/\d\d\d\d/\d\d/\d\d/([^ ]+) “,
        RegexOptions.Compiled | RegexOptions.CultureInvariant);

    var result = from line in data
                 let match = regex.Match(line)
                 where match.Success
                 group match by match.Groups[1].Value into grp
                 orderby grp.Count() descending
                 select new { Article = grp.Key, Count = grp.Count() };

    foreach (var v in result.Take(10))
        Console.WriteLine(“{0}: {1}”, v.Article, v.Count);
}

My programmer's man-crushes continue as Jomo Fisher posts the WideFinder Naive F# Implementation. Notice how everyone uses "naive" to basically say "I'm sure it could be better, so don't be mean." I can't tell you with a straight face that I totally understand this. It's kind of magical.

#light
open System.Text.RegularExpressions
open System.IO
open System.Text
 
let regex = new Regex(@"GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+)", RegexOptions.Compiled)

let seqRead fileName =
    seq { use reader = new StreamReader(File.OpenRead(fileName))
          while not reader.EndOfStream do
              yield reader.ReadLine() }

let query fileName = 
    seqRead fileName
    |> Seq.map (fun line -> regex.Match(line)) 
    |> Seq.filter (fun regMatch -> regMatch.Success)
    |> Seq.map (fun regMatch -> regMatch.Value)
    |> Seq.countBy (fun url -> url)

*And here's the code to call it:    

for result in query @"file.txt" do 
    let url, count = result

Busting out of the Microsoft Languages for a minute, here's Tim's Ruby example:

counts = {}
counts.default = 0

ARGF.each_line do |line|
  if line =~ %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) }
    counts[$1] += 1
  end
end

keys_by_count = counts.keys.sort { |a, b| counts[b] <=> counts[a] }
keys_by_count[0 .. 9].each do |key|
  puts "#{counts[key]}: #{key}"
end

Here's the Java version from UnintentionalObjectRetention. I haven't done Java since I worked at Nike in 1997 as a contractor:

public class WideFinder {     
  public static void main(String[] args) throws IOException {      
    Map<String, Integer> counts = new HashMap<String, Integer>();      
    Pattern p = Pattern.compile("GET /ongoing/When/\\d\\d\\dx/(\\d\\d\\d\\d/\\d\\d/\\d\\d/[^ .]+) ");  
    BufferedReader in = new BufferedReader(new InputStreamReader(           
         new FileInputStream(args[0]), "US-ASCII"));         
    String line = null;     
    while ((line = in.readLine()) != null) {       
       Matcher m = p.matcher(line);       
       if (m.find()) {            
         String key = m.group();     
         Integer currentCount = counts.get(key);   
         counts.put(key, (currentCount == null ? 1 : (currentCount + 1)));     
       }
    }    
    in.close();      
    List<Entry<String, Integer>> results = new ArrayList<Map.Entry<String, Integer>>(counts.entrySet());      Collections.sort(results, new Comparator<Entry<String, Integer>>() {   
       public int compare(Entry<String, Integer> o1, Entry<String, Integer> o2)        
       {            
          return o2.getValue().compareTo(o1.getValue());     
       }     
     });   
     for(int i = 0; i < 10; i++) {    
       System.out.println(results.get(i));    
     }
  }
}

And last, but kind of first, here it is in LISP from Nate.

(defun run (&rest logs)
  (let ((counts (make-hash-table :test #'equal)))
    (dolist (filename logs)
      (with-open-file (stream filename :direction :input
                              :external-format :latin-1)
        (loop for line = (read-line stream nil stream)
           until (eq line stream)
           do (cl-ppcre:register-groups-bind (match)
                  ("GET /ongoing/When/\\d{3}x/(\\d{4}/\\d{2}/\\d{2}/[^ .]+) " line)
                (incf (gethash match counts 0))))))
    (loop for key being the hash-keys of counts
       collect key into keys
       finally (map nil #'(lambda (x)
                            (format t "~D: ~A~%" (gethash x counts) x))
                    (subseq (sort keys #'>
                                  :key #'(lambda (x) (gethash x counts))) 0 10)))))

A good way to understand other languages (programming or human) is to read the same story in each of these languages and compare them. Tim's problem serves that purpose well!

Oh, and if you want to see why we program in Managed Code, check out the C version.

Feel free to send me links to cool source that you find hasn't been given a good read.

About Scott

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.

About Newsletter

Hosting By

Hosted on Linux using .NET in an Azure App Service

Comment on this post [7]

Share on BlueSky or use the Permalink and post anywhere!

October 22, 2007 23:17

I'm not sure that I'm a fan of LINQ for things like this. I don't see how the LINQ versions gets you anything that good ole fashioned C# code doesn't. It just kinda looks like a couple of blobs of text to me that hide the logical structure of the code. You could write the old fashioned version in about the same number of lines and it would be readable to the millions of existing C# programmers (plus semi-readable to Java and C++ programmers) and not just those of us who keep up with the latest and greatest new features. The Java version best matches the spirit of how I'd approach this but it would look a lot cleaner written in C# -- particularly with lambda expressions and var.

PLINQ is somewhat compelling but the algorithm here is just reading from a file sequentially to count the lines and then doing a sort. I'm not sure that either of these phases could be efficiently parallelizable.

Are there any test log files floating around for this? I took a look at some of the pages you linked to but couldn't find any.

Dave

October 22, 2007 23:24

*looks at the C code* Just say no to C ... !

Speaking of yield, I'd be interested to find anyone using it to seriously implement continuations in a client application. Or really, for anything more complicated than tic tac toe.

-Rick (currently dodging plumes of smoke at SoCal TechFest)

Rick Brewster

October 23, 2007 3:58

I don't think you could've missed this but Steve Vinoski has gone absolutely nuts squeezing every last bit of performance beating this problem to pulp with Erlang. See: http://steve.vinoski.net/blog

Dilip

October 23, 2007 16:42

Dilip - Ya, Steve is obsessed! :)

Scott Hanselman

October 24, 2007 8:09

To be correct, the F# example needs two more lines: add "|> Seq.take 10" after "|> Seq.countBy (fun url -> url)" to limit output to the top 10 entries; then, add "printfn "%A - %A" url count" after "let url, count = result" to output the results to stdout.

James

October 26, 2007 22:08

Scott, don't forget about PowerShell! :-) I'm not sure how it does on #6 (perf) cuz I don't have any large IIS log files laying around but it rocks on #5 (elegant and concise):

PS> $pattern = 'GET /ongoing/When/\d{3}x/(?<url>\d{4}/\d{2}/\d{2}/[^ .]+)'
PS> Get-Content *.log | Foreach {if ($_ -match $pattern){$matches.url}} | Group | Sort Count -desc | Select -first 10

Keith Hill

November 08, 2007 15:43

Scott... I love reading your Weekly Source Code articles. Unfortunately they come out serious garbled when viewed in Google Reader. All the source code shows up unformatted on a single line (no carriage returns or nothing). I blogged about this as a pet peeve of mine... blogs that get seriously destroyed by feed aggregators. Some technical blogs get this right... some don't. If you want to see what I'm talking about with regards to this article, see my post today which includes screenshots.

I'm not claiming my blog is a paragon of visual goodness... it most definatley isn't. But it is at least readible in Google Reader.

Thanks for all the good work you do!

Simon R. Gillbee

Comments are closed.