Scott Hanselman

The Weekly Source Code 9 - WideFinder Edition

October 23, '07 Comments [7] Posted in ASP.NET | Microsoft | Source Code
Sponsored By

In my new ongoing quest to read source code to be a better developer, I now present the ninth in an infinite number of a weekly series called "The Weekly Source Code." Here's some source I'm reading this week that I enjoyed.

Ya, I know this one is just 4 days after the last one, but I was having too much fun and couldn't wait until Wednesday. Plus, it's a new week so poop on you.

Last month Tim Bray shared his experiences writing a program that does, well, here's Joe Cheng's succinct description. Tim calls the project WideFinder.

The Wide Finder challenge is to write a program that:

  1. Scans logfiles for hits on blog articles
  2. which are counted and
  3. sorted with the
  4. top 10 most popular being printed to stdout. It should also
  5. be about as elegant and concise as Tim’s Ruby version and
  6. its performance should scale with additional CPU cores.

And this is done on a fairly large log file of about 250 megs. While Item #6 is the most interesting, many folks are focusing on Item #5. Either way, it's a heck of a lot more interesting problem than FizzBuzz and worth adding to your interview arsenal pocket.

I encourage you to go check out Tim's site as he's continued to list the sources that he finds most interesting. As a primarily C# programmer who's always trying to stretch out of my comfort zone, here's what I've found interesting, in the order I found them interesting.

  • Don Box's Naive Implementation in C# 3.0 - Apparently this is the kind of code Don can write after two beers. Notice the use of yield to make this "LINQ over a text file of CR/LF strings." That's one of those write-it-constantly-over-and-over-again helper methods that makes me wonder why it wasn't just included.
  •     static void Main(string[] args)
        {
            var regex = new Regex(@"GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+)");
     
            var grouped = from line in ReadLinesFromFile(@"C:\temp\bray.txt")
                          let match = regex.Match(line)
                          where match.Success
                          let url = match.Value
                          group url by url;
      
            var ordered = from g in grouped
                          let count = g.Count()
                          orderby count descending
                          select new { Count = count, Key = g.Key };
            
            foreach (var item in ordered.Take(10))
                Console.WriteLine("{0}: {1}", item.Count, item.Key);
        }
     
        // LINQ-compatible streaming I/O helper
        public static IEnumerable<string> ReadLinesFromFile(string filename)
        {
            using (StreamReader reader = new StreamReader(filename)) {
                while (true)
                {
                    string s = reader.ReadLine();
                    if (s == null)
                        break;
                    yield return s;
                }
        }
  • Joe Cheng tightens it up with his LINQ skillz and does the group and sort all in one swell foop. As an aside, I'd like to see his QuickTimer class for next week. Nice use of my favorite C# idiom - IDisposable/using. Joe also alludes to some parallelism that could be easily added with PLINQ. Maybe we'll see that code soon.
  • using (new QuickTimer(“Total time”))
    {
        IEnumerable<string> data = new LineReader(args[0]);
    
        Regex regex = new Regex(@”GET /ongoing/When/\d\d\dx/\d\d\d\d/\d\d/\d\d/([^ ]+) “,
            RegexOptions.Compiled | RegexOptions.CultureInvariant);
    
        var result = from line in data
                     let match = regex.Match(line)
                     where match.Success
                     group match by match.Groups[1].Value into grp
                     orderby grp.Count() descending
                     select new { Article = grp.Key, Count = grp.Count() };
    
        foreach (var v in result.Take(10))
            Console.WriteLine(“{0}: {1}”, v.Article, v.Count);
    }
  • My programmer's man-crushes continue as Jomo Fisher posts the WideFinder Naive F# Implementation. Notice how everyone uses "naive" to basically say "I'm sure it could be better, so don't be mean." I can't tell you with a straight face that I totally understand this. It's kind of magical.
  • #light
    open System.Text.RegularExpressions
    open System.IO
    open System.Text
     
    let regex = new Regex(@"GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+)", RegexOptions.Compiled)
    
    let seqRead fileName =
        seq { use reader = new StreamReader(File.OpenRead(fileName))
              while not reader.EndOfStream do
                  yield reader.ReadLine() }
    
    let query fileName = 
        seqRead fileName
        |> Seq.map (fun line -> regex.Match(line)) 
        |> Seq.filter (fun regMatch -> regMatch.Success)
        |> Seq.map (fun regMatch -> regMatch.Value)
        |> Seq.countBy (fun url -> url)
    
    *And here's the code to call it:    
    
    for result in query @"file.txt" do 
        let url, count = result
  • Busting out of the Microsoft Languages for a minute, here's Tim's Ruby example:
  • counts = {}
    counts.default = 0
    
    ARGF.each_line do |line|
      if line =~ %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) }
        counts[$1] += 1
      end
    end
    
    keys_by_count = counts.keys.sort { |a, b| counts[b] <=> counts[a] }
    keys_by_count[0 .. 9].each do |key|
      puts "#{counts[key]}: #{key}"
    end
  • Here's the Java version from UnintentionalObjectRetention. I haven't done Java since I worked at Nike in 1997 as a contractor:
  • public class WideFinder {     
      public static void main(String[] args) throws IOException {      
        Map<String, Integer> counts = new HashMap<String, Integer>();      
        Pattern p = Pattern.compile("GET /ongoing/When/\\d\\d\\dx/(\\d\\d\\d\\d/\\d\\d/\\d\\d/[^ .]+) ");  
        BufferedReader in = new BufferedReader(new InputStreamReader(           
             new FileInputStream(args[0]), "US-ASCII"));         
        String line = null;     
        while ((line = in.readLine()) != null) {       
           Matcher m = p.matcher(line);       
           if (m.find()) {            
             String key = m.group();     
             Integer currentCount = counts.get(key);   
             counts.put(key, (currentCount == null ? 1 : (currentCount + 1)));     
           }
        }    
        in.close();      
        List<Entry<String, Integer>> results = new ArrayList<Map.Entry<String, Integer>>(counts.entrySet());      Collections.sort(results, new Comparator<Entry<String, Integer>>() {   
           public int compare(Entry<String, Integer> o1, Entry<String, Integer> o2)        
           {            
              return o2.getValue().compareTo(o1.getValue());     
           }     
         });   
         for(int i = 0; i < 10; i++) {    
           System.out.println(results.get(i));    
         }
      }
    }
  • And last, but kind of first, here it is in LISP from Nate.
  • (defun run (&rest logs)
      (let ((counts (make-hash-table :test #'equal)))
        (dolist (filename logs)
          (with-open-file (stream filename :direction :input
                                  :external-format :latin-1)
            (loop for line = (read-line stream nil stream)
               until (eq line stream)
               do (cl-ppcre:register-groups-bind (match)
                      ("GET /ongoing/When/\\d{3}x/(\\d{4}/\\d{2}/\\d{2}/[^ .]+) " line)
                    (incf (gethash match counts 0))))))
        (loop for key being the hash-keys of counts
           collect key into keys
           finally (map nil #'(lambda (x)
                                (format t "~D: ~A~%" (gethash x counts) x))
                        (subseq (sort keys #'>
                                      :key #'(lambda (x) (gethash x counts))) 0 10)))))

A good way to understand other languages (programming or human) is to read the same story in each of these languages and compare them. Tim's problem serves that purpose well!

Oh, and if you want to see why we program in Managed Code, check out the C version.

Feel free to send me links to cool source that you find hasn't been given a good read.

About Scott

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.

facebook twitter subscribe
About   Newsletter
Sponsored By
Hosting By
Dedicated Windows Server Hosting by ORCS Web
Tuesday, October 23, 2007 7:17:55 AM UTC
I'm not sure that I'm a fan of LINQ for things like this. I don't see how the LINQ versions gets you anything that good ole fashioned C# code doesn't. It just kinda looks like a couple of blobs of text to me that hide the logical structure of the code. You could write the old fashioned version in about the same number of lines and it would be readable to the millions of existing C# programmers (plus semi-readable to Java and C++ programmers) and not just those of us who keep up with the latest and greatest new features. The Java version best matches the spirit of how I'd approach this but it would look a lot cleaner written in C# -- particularly with lambda expressions and var.

PLINQ is somewhat compelling but the algorithm here is just reading from a file sequentially to count the lines and then doing a sort. I'm not sure that either of these phases could be efficiently parallelizable.

Are there any test log files floating around for this? I took a look at some of the pages you linked to but couldn't find any.
Dave
Tuesday, October 23, 2007 7:24:28 AM UTC
*looks at the C code* Just say no to C ... !

Speaking of yield, I'd be interested to find anyone using it to seriously implement continuations in a client application. Or really, for anything more complicated than tic tac toe.

-Rick (currently dodging plumes of smoke at SoCal TechFest)
Tuesday, October 23, 2007 11:58:21 AM UTC
I don't think you could've missed this but Steve Vinoski has gone absolutely nuts squeezing every last bit of performance beating this problem to pulp with Erlang. See: http://steve.vinoski.net/blog
Dilip
Wednesday, October 24, 2007 12:42:41 AM UTC
Dilip - Ya, Steve is obsessed! :)
Wednesday, October 24, 2007 4:09:34 PM UTC
To be correct, the F# example needs two more lines: add "|> Seq.take 10" after "|> Seq.countBy (fun url -> url)" to limit output to the top 10 entries; then, add "printfn "%A - %A" url count" after "let url, count = result" to output the results to stdout.
James
Saturday, October 27, 2007 6:08:51 AM UTC
Scott, don't forget about PowerShell! :-) I'm not sure how it does on #6 (perf) cuz I don't have any large IIS log files laying around but it rocks on #5 (elegant and concise):

PS> $pattern = 'GET /ongoing/When/\d{3}x/(?<url>\d{4}/\d{2}/\d{2}/[^ .]+)'
PS> Get-Content *.log | Foreach {if ($_ -match $pattern){$matches.url}} | Group | Sort Count -desc | Select -first 10
Thursday, November 08, 2007 11:43:25 PM UTC
Scott... I love reading your Weekly Source Code articles. Unfortunately they come out serious garbled when viewed in Google Reader. All the source code shows up unformatted on a single line (no carriage returns or nothing). I blogged about this as a pet peeve of mine... blogs that get seriously destroyed by feed aggregators. Some technical blogs get this right... some don't. If you want to see what I'm talking about with regards to this article, see my post today which includes screenshots.

I'm not claiming my blog is a paragon of visual goodness... it most definatley isn't. But it is at least readible in Google Reader.

Thanks for all the good work you do!
Comments are closed.

Disclaimer: The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.