Scott Hanselman

The Weekly Source Code 9 - WideFinder Edition

October 22, 2007 Comment on this post [7] Posted in ASP.NET | Microsoft | Source Code
Sponsored By

In my new ongoing quest to read source code to be a better developer, I now present the ninth in an infinite number of a weekly series called "The Weekly Source Code." Here's some source I'm reading this week that I enjoyed.

Ya, I know this one is just 4 days after the last one, but I was having too much fun and couldn't wait until Wednesday. Plus, it's a new week so poop on you.

Last month Tim Bray shared his experiences writing a program that does, well, here's Joe Cheng's succinct description. Tim calls the project WideFinder.

The Wide Finder challenge is to write a program that:

  1. Scans logfiles for hits on blog articles
  2. which are counted and
  3. sorted with the
  4. top 10 most popular being printed to stdout. It should also
  5. be about as elegant and concise as Tim’s Ruby version and
  6. its performance should scale with additional CPU cores.

And this is done on a fairly large log file of about 250 megs. While Item #6 is the most interesting, many folks are focusing on Item #5. Either way, it's a heck of a lot more interesting problem than FizzBuzz and worth adding to your interview arsenal pocket.

I encourage you to go check out Tim's site as he's continued to list the sources that he finds most interesting. As a primarily C# programmer who's always trying to stretch out of my comfort zone, here's what I've found interesting, in the order I found them interesting.

  • Don Box's Naive Implementation in C# 3.0 - Apparently this is the kind of code Don can write after two beers. Notice the use of yield to make this "LINQ over a text file of CR/LF strings." That's one of those write-it-constantly-over-and-over-again helper methods that makes me wonder why it wasn't just included.
  •     static void Main(string[] args)
        {
            var regex = new Regex(@"GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+)");
     
            var grouped = from line in ReadLinesFromFile(@"C:\temp\bray.txt")
                          let match = regex.Match(line)
                          where match.Success
                          let url = match.Value
                          group url by url;
      
            var ordered = from g in grouped
                          let count = g.Count()
                          orderby count descending
                          select new { Count = count, Key = g.Key };
            
            foreach (var item in ordered.Take(10))
                Console.WriteLine("{0}: {1}", item.Count, item.Key);
        }
     
        // LINQ-compatible streaming I/O helper
        public static IEnumerable<string> ReadLinesFromFile(string filename)
        {
            using (StreamReader reader = new StreamReader(filename)) {
                while (true)
                {
                    string s = reader.ReadLine();
                    if (s == null)
                        break;
                    yield return s;
                }
        }
  • Joe Cheng tightens it up with his LINQ skillz and does the group and sort all in one swell foop. As an aside, I'd like to see his QuickTimer class for next week. Nice use of my favorite C# idiom - IDisposable/using. Joe also alludes to some parallelism that could be easily added with PLINQ. Maybe we'll see that code soon.
  • using (new QuickTimer(“Total time”))
    {
        IEnumerable<string> data = new LineReader(args[0]);
    
        Regex regex = new Regex(@”GET /ongoing/When/\d\d\dx/\d\d\d\d/\d\d/\d\d/([^ ]+) “,
            RegexOptions.Compiled | RegexOptions.CultureInvariant);
    
        var result = from line in data
                     let match = regex.Match(line)
                     where match.Success
                     group match by match.Groups[1].Value into grp
                     orderby grp.Count() descending
                     select new { Article = grp.Key, Count = grp.Count() };
    
        foreach (var v in result.Take(10))
            Console.WriteLine(“{0}: {1}”, v.Article, v.Count);
    }
  • My programmer's man-crushes continue as Jomo Fisher posts the WideFinder Naive F# Implementation. Notice how everyone uses "naive" to basically say "I'm sure it could be better, so don't be mean." I can't tell you with a straight face that I totally understand this. It's kind of magical.
  • #light
    open System.Text.RegularExpressions
    open System.IO
    open System.Text
     
    let regex = new Regex(@"GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+)", RegexOptions.Compiled)
    
    let seqRead fileName =
        seq { use reader = new StreamReader(File.OpenRead(fileName))
              while not reader.EndOfStream do
                  yield reader.ReadLine() }
    
    let query fileName = 
        seqRead fileName
        |> Seq.map (fun line -> regex.Match(line)) 
        |> Seq.filter (fun regMatch -> regMatch.Success)
        |> Seq.map (fun regMatch -> regMatch.Value)
        |> Seq.countBy (fun url -> url)
    
    *And here's the code to call it:    
    
    for result in query @"file.txt" do 
        let url, count = result
  • Busting out of the Microsoft Languages for a minute, here's Tim's Ruby example:
  • counts = {}
    counts.default = 0
    
    ARGF.each_line do |line|
      if line =~ %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) }
        counts[$1] += 1
      end
    end
    
    keys_by_count = counts.keys.sort { |a, b| counts[b] <=> counts[a] }
    keys_by_count[0 .. 9].each do |key|
      puts "#{counts[key]}: #{key}"
    end
  • Here's the Java version from UnintentionalObjectRetention. I haven't done Java since I worked at Nike in 1997 as a contractor:
  • public class WideFinder {     
      public static void main(String[] args) throws IOException {      
        Map<String, Integer> counts = new HashMap<String, Integer>();      
        Pattern p = Pattern.compile("GET /ongoing/When/\\d\\d\\dx/(\\d\\d\\d\\d/\\d\\d/\\d\\d/[^ .]+) ");  
        BufferedReader in = new BufferedReader(new InputStreamReader(           
             new FileInputStream(args[0]), "US-ASCII"));         
        String line = null;     
        while ((line = in.readLine()) != null) {       
           Matcher m = p.matcher(line);       
           if (m.find()) {            
             String key = m.group();     
             Integer currentCount = counts.get(key);   
             counts.put(key, (currentCount == null ? 1 : (currentCount + 1)));     
           }
        }    
        in.close();      
        List<Entry<String, Integer>> results = new ArrayList<Map.Entry<String, Integer>>(counts.entrySet());      Collections.sort(results, new Comparator<Entry<String, Integer>>() {   
           public int compare(Entry<String, Integer> o1, Entry<String, Integer> o2)        
           {            
              return o2.getValue().compareTo(o1.getValue());     
           }     
         });   
         for(int i = 0; i < 10; i++) {    
           System.out.println(results.get(i));    
         }
      }
    }
  • And last, but kind of first, here it is in LISP from Nate.
  • (defun run (&rest logs)
      (let ((counts (make-hash-table :test #'equal)))
        (dolist (filename logs)
          (with-open-file (stream filename :direction :input
                                  :external-format :latin-1)
            (loop for line = (read-line stream nil stream)
               until (eq line stream)
               do (cl-ppcre:register-groups-bind (match)
                      ("GET /ongoing/When/\\d{3}x/(\\d{4}/\\d{2}/\\d{2}/[^ .]+) " line)
                    (incf (gethash match counts 0))))))
        (loop for key being the hash-keys of counts
           collect key into keys
           finally (map nil #'(lambda (x)
                                (format t "~D: ~A~%" (gethash x counts) x))
                        (subseq (sort keys #'>
                                      :key #'(lambda (x) (gethash x counts))) 0 10)))))

A good way to understand other languages (programming or human) is to read the same story in each of these languages and compare them. Tim's problem serves that purpose well!

Oh, and if you want to see why we program in Managed Code, check out the C version.

Feel free to send me links to cool source that you find hasn't been given a good read.

About Scott

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.

facebook bluesky subscribe
About   Newsletter
Hosting By
Hosted on Linux using .NET in an Azure App Service

Where's my Orange Box Royalties?

October 22, 2007 Comment on this post [6] Posted in Musings
Sponsored By

The Half-Life 2 Orange Box is out for the Xbox360 and PC and my buddy Kevin at DiabeTech emailed me this:

"I was image surfing and found this pic. I honestly had to do a double-take as I thought it was you...so, I took a few minutes out and entertained myself by creating one with your likeness.  Just a little Monday morning humor ;)  I call it Hanselraker."

Thanks Kevin! Yikes, that does look like me pre-Laser Eye Surgery. That's not the best pic of me on the left there, but still, might be time to either lose the beard all together and go BabyFace or head the opposite direction and go GrizzlyAdams. Either way, Ryan Reynolds still has my career.

hanselraker

Where's my Half-Life 2 royalties? Fair use my *ss! ;)

About Scott

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.

facebook bluesky subscribe
About   Newsletter
Hosting By
Hosted on Linux using .NET in an Azure App Service

Web Common Sense isn't too Common

October 20, 2007 Comment on this post [26] Posted in Musings
Sponsored By

Ah, my poor family. The Web is too complex. I don't mean from a navigational sense, I mean from a navigating your way around it sense.

My mom sent me some emails today asking about scams. Here's one:

"I apologize for continuing to send this stuff, but I do not know what is real and what is junk…is this real?

REMINDER....all cell phone numbers are being released to telemarketing companies tomorrow and you will start to receive sale calls."

Visiting the FTC's website and confirming the SSL Certificate shows that this rumor about Cell Phones in the US and Telemarketers is not true.

But should my Mom have to worry about this kind of thing? Perhaps this is just life in a complex world?

My worry is, what if she asks me for her advice on one of these things and one day I'm wrong? I tell her to suspect everything. I also say that if someone called you at home and said the same thing they're saying in the email, would you believe them? Some how things in text seem more credible than things said out loud.

I know you're the IT Department for your family, Dear Reader. This is our charge and ours to bear happily, but are you also the Web Bullsh*t detector?

How do you protect your friends and family from these things? How do you teach Web Savvy?

About Scott

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.

facebook bluesky subscribe
About   Newsletter
Hosting By
Hosted on Linux using .NET in an Azure App Service

Hanselminutes Podcast 85 - EarthClassMail.com - Moving from LAMP to .NET 3.5

October 19, 2007 Comment on this post [0] Posted in ASP.NET | Microsoft | Podcast | Programming
Sponsored By

logo My eighty-fifth podcast is up. In this one I talk to Matt Davis, an Architect at EarthClassMail.com about their switch from a LAMP stack (Linux, apache, mysql, php/perl/python) to .NET 3.x. What's working, what's not, and what kinds of issues are they are running into as their architect their solution.

Subscribe: Subscribe to Hanselminutes Subscribe to my Podcast in iTunes

If you have trouble downloading, or your download is slow, do try the torrent with µtorrent or another BitTorrent Downloader.

Do also remember the complete archives are always up and they have PDF Transcripts, a little known feature that show up a few weeks after each show.

Telerik is our sponsor for this show.

Check out their UI Suite of controls for ASP.NET. It's very hardcore stuff. One of the things I appreciate about Telerik is their commitment to completeness. For example, they have a page about their Right-to-Left support while some vendors have zero support, or don't bother testing. They also are committed to XHTML compliance and publish their roadmap. It's nice when your controls vendor is very transparent.

As I've said before this show comes to you with the audio expertise and stewardship of Carl Franklin. The name comes from Travis Illig, but the goal of the show is simple. Avoid wasting the listener's time. (and make the commute less boring)

Enjoy. Who knows what'll happen in the next show?

About Scott

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.

facebook bluesky subscribe
About   Newsletter
Hosting By
Hosted on Linux using .NET in an Azure App Service

Hanselminutes spotted in the wild

October 18, 2007 Comment on this post [17] Posted in Podcast
Sponsored By

This was just too cool not to mention, so forgive me this indulgence. I was over at Moe's Restaurant for lunch a few days ago and this fine gentleman said "Hi Scott" and was, at that moment, listening to Hanselminutes. Amazing coincidence!

hanselminutesphoto2

How cool is that? Are you listening? I think I'll start giving out prizes if this happens again! I should have bought him lunch.

The first time this happened it was in the Chicago Airport bathroom. That was slightly considerably more disturbing, as I was "in the process" at the time. Anyway, so cool that folks enjoy the show.

Keep sending me show ideas!

About Scott

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.

facebook bluesky subscribe
About   Newsletter
Hosting By
Hosted on Linux using .NET in an Azure App Service

Disclaimer: The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.