Scott Hanselman

Parsing CSVs and Poor Man's Web Log Analysis with PowerShell

October 31, '07 Comments [17] Posted in ASP.NET | PowerShell
Sponsored By

I was messing around with some log files for the podcast and exported a CSV. There's lots of ways for us to get stats, but I was just poking around.

Given a CSV format like this:

"File","Hits","Bandwidth"
"/hanselminutes_0026_lo.wma","78173","163625808"
"/hanselminutes_0076_robert_pickering.wma","24626","-1789110063"
"/hanselminutes_0077.wma","17204","1959963618"
"/hanselminutes_0076_robert_pickering.mp3","15796","-55874279"
"/hanselminutes_0078.wma","14832","-1241370004"
"/hanselminutes_0075.mp3","13685","-1840937989"
"/hanselminutes_0075.wma","12129","1276597408"
"/hanselminutes_0078.mp3","11058","-1186433073"

And noticing that there are different file extensions, but one logical show...for example there's an MP3 of Show 76 and WMA of that same show I wanted to find out how many downloads per show.

I fired up PowerShell. First I can bring in the CSV file and notice that it's parsed into a PowerShell object list nicely:

PS C:\> import-csv File.csv | Select File,Hits

File                                                        Hits
----                                                        ----
/hanselminutes_0026_lo.wma                                  78173
/hanselminutes_0076_robert_pickering.wma                    24626
/hanselminutes_0077.wma                                     17204
/hanselminutes_0076_robert_pickering.mp3                    15796
/hanselminutes_0078.wma                                     14832
/hanselminutes_0075.mp3                                     13685
/hanselminutes_0075.wma                                     12129
/hanselminutes_0078.mp3                                     11058

Notice that there are headings in this output? PowerShell's Import-CSV is smart enough to infer the headers and positions and split things up. After the first | pipe I don't ever have to think CSV again.

I'm going to need a new column, though, called Show with the Show number it in. I'll probably glean it from the Filename. Just to get started, I'll make a new "calculated column."

PS C:\> import-csv file.csv | select File, Hits, @{Name="Show";Expression={ $_.File } }

File                                    Hits    Show
----                                    ----    ----
/hanselminutes_0026_lo.wma              78173   /hanselminutes_0026_lo.wma
/hanselminutes_0076_robert_pickering... 24626   /hanselminutes_0076_robert_pickering...
/hanselminutes_0077.wma                 17204   /hanselminutes_0077.wma
/hanselminutes_0076_robert_pickering... 15796   /hanselminutes_0076_robert_pickering...
/hanselminutes_0078.wma                 14832   /hanselminutes_0078.wma
/hanselminutes_0075.mp3                 13685   /hanselminutes_0075.mp3
/hanselminutes_0075.wma                 12129   /hanselminutes_0075.wma
/hanselminutes_0078.mp3                 11058   /hanselminutes_0078.mp3
/hanselminutes_0077.mp3                 11003   /hanselminutes_0077.mp3
/hanselminutes_0074.mp3                 6494    /hanselminutes_0074.mp3

Except you can see that while I have a new column called "Show" it's just got the File duplicated. Notice the format for making a new Column:
select File, Hits, @{Name="Show";Expression={ $_.File } }.

Inside that { } scriptblock I need to do my work to get a show number.

It's a classic joke, so I'll say it again: "So you've got a problem and you've decided to use Regular Expressions to solve it. Now you've got two problems."

The show format has changed a bit over the years and some have leading zeros, some have names at the end. It's organic. So, in front of the expression we add a Regular Expression. You can make them easily in PowerShell by putting [regex] in front of a string. The ; is just a separator between statements. The first statement is just the assignment of a RegEx to a variable called $re. Later in my script block I apply that RegEx to the $_.File (remember that $_ is the equivalent of "this" in PowerShell, so $_.File is this.File).

The @{Name="Value";Name="Value"} format is how to express a Hashtable to PowerShell.

PS C:\> $re =[regex]"\d{2}(?=[_.])"; import-csv file.csv | 
        select File, Hits, @{Name="Show";Expression={$re.matches($_.File)[0] } }

File                                    Hits     Show
----                                    ----     ----
/hanselminutes_0026_lo.wma              78173    26
/hanselminutes_0076_robert_pickering... 24626    76
/hanselminutes_0077.wma                 17204    77
/hanselminutes_0076_robert_pickering... 15796    76
/hanselminutes_0078.wma                 14832    78
/hanselminutes_0075.mp3                 13685    75
/hanselminutes_0075.wma                 12129    75
/hanselminutes_0078.mp3                 11058    78
/hanselminutes_0077.mp3                 11003    77
/hanselminutes_0074.mp3                 6494     74

The RegEx "\d{2}(?=[_.])" says "find me the first two decimals to the left of either a "_" or a "." It's not foolproof, at least not until show 100, but this is a Poor Man's Solution.

Next, I'll sort by Show descending then group all the like shows together.

PS C:\> $re =[regex]"\d{2}(?=[_.])"; import-csv file.csv |
        select File, Hits, @{Name="Show";Expression={$re.matches($_.File)[0] } } | 
        sort Show -desc | group Show

Count Name  Group
----- ----  -----
    9 79    {@{File=/Hanselminutes_0079.wma; Hits=27; Show=79},...
   12 78    {@{File=/hanselminutes_0078.m4b.torrent; Hits=18; Show=78},...
   12 77    {@{File=/hanselminutes_0077.wma.torrent; Hits=52; Show=77},...
   18 76    {@{File=/hanselminutes_76.mp3; Hits=1; Show=76}, ...
   11 75    {@{File=/hanselminutes_0075_lo.wma; Hits=468; Show=75},...

In this listing I can see via Count that there were 9 different formats of Show 79 downloaded in this time period. However, the good data is trapped in the Group column. All my previous line items are grouped in there. I need to sum up all the Hits for all the downloads of the the 9 different formats of Show 76...and all the shows.

PS C:\> $re =[regex]"\d{2}(?=[_.])"; import-csv file.csv |
    select File, Hits, @{Name="Show";Expression={$re.matches($_.File)[0] } } | 
    sort Show -desc | group Show | 
    select Name,{($_.Group | Measure-Object -Sum Hits).Sum }

Name   ($_.Group | Measure-Object -Sum Hits).Sum
----   ------------------------------------------
79      9205
78     27575
77     29807
76     42798
75     27174
74     13060
73     10532
72     10145
71      9826
70      8745
69      8065
68      8132
67      7024
66      8535
65     13041

Now, I select out Name and then take each item's Group (remember $_ is this, so $_.Group means "the current item's Group." Each item will go through Measure-Object which has options like -sum and -average, etc. We grab them up and Sum them.

But the Column Header looks silly. I want a pretty name.

PS C:\> $re =[regex]"\d{2}(?=[_.])"; import-csv file.csv |
    select File, Hits, @{Name="Show";Expression={$re.matches($_.File)[0] } } | 
    sort Show -desc | group Show |  
    select Name,
    @{Name="Hits";Expression={($_.Group | Measure-Object -Sum Hits).Sum }}

I just make a Hashtable with the Name "Hits" and put the scriptblock in a key called "Expression." Add some more sorting and:

Name   Hits
----   ----
26     78673
76     42798
77     29807
78     27575
75     27174
74     13060
65     13041
73     10532

Now I can see that show 76 got 42798 downloads this time period and the classic show 26 did nicely with 78673.


It this representative of the way you SHOULD do things in PowerShell? No, but it's a transcript of a real-life problem being together at the command line. It took less time than it would take to write a C# program and it's easily modified ad-hoc. Here's what Lee Holmes said he would have done, accomplishing the same thing not as a one-liner but as a proper PowerShell function:

function Get-ShowHits
{
  $regex = '/hanselminutes_(\d+).*'
  $shows = Import-CSv File.csv | Select File,Hits | Group { $_.File -replace $regex,'$1' }
  foreach($show in $shows) 
  {
    $showOutput = New-Object System.Management.Automation.PsObject
    $showOutput | Add-Member NoteProperty Name $show.Name
    $showOutput | Add-Member NoteProperty Hits ($show.Group | Measure-Object -Sum Hits).Sum
    $showOutput
  }
}
Get-ShowHits | Sort -Desc Hits

Both are great examples of why PowerShell rocks my world. When are you going to try it? I swear, everyone has "been meaning to try PowerShell but I haven't had the time." Just do it.

About Scott

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. I am a failed stand-up comic, a cornrower, and a book author.

facebook twitter subscribe
About   Newsletter
Sponsored By
Hosting By
Dedicated Windows Server Hosting by ORCS Web
Wednesday, October 31, 2007 9:55:54 AM UTC
Scott,

I always use LogParser from Microsoft to mess with these types of files SQL like expressions but the ability to handle large files very quickly and parse multiple log types (including plain csv or xml files).

http://www.microsoft.com/downloads/details.aspx?FamilyID=890cd06b-abf8-4c25-91b2-f8d975cf8c07&displaylang=en

I have to admit that the last time I was doing this I couldn't help thinking "why am I not in Powershell", but LogParser has always proven to be a very fast parser for me on large log type files.

That said, it's good to have a few tools in your belt to be able to pull out and do stuff like this. It strikes me that if you make the effort to learn PowerShell then you are always carrying around a really nice swiss-army knife
Wednesday, October 31, 2007 12:46:04 PM UTC
Scott,

Just out of interest what is your 'IDE' for PowerShell - and what add-ons/extensions do you have installed etc ??

.. Ken
Wednesday, October 31, 2007 2:55:00 PM UTC
Martin actually beat me to it - Log Parser. I wouldn't even have thought to go the PowerShell route on this one.
Wednesday, October 31, 2007 3:16:01 PM UTC
While I have used Log Parser like Martin above describes, your example triggered a whole thought chain that led me to download Powershell and use it to solve a problem I had been having trying to quickly extract some information from a series of CSV files.

So in line with your final comment in your post, I just did it! But now I am trying for the 1000 time to truly grok regular expressions, sigh....

Cheers,

Robert Porter
Wednesday, October 31, 2007 4:38:38 PM UTC
Yes, I just did it too. Around the time when we have the first nerd dinner in Bellevue, I watched your dnrtv on PowerShell. That show really turned my light bulb on. Since then, I'm hooked to it. I even started to do a screencast on my blog to educate my Japanese audience. The more you dig into PowerShell, the deeper it gets. It's very fun. I'm starting to use PowerShell for my work as well. Script block is another cool thing you showed me here. I will be digging into it soon.

BTW, PowerShell team is doing usability test on the next version of PowerShell. Sure enough, they have a remoting capability... I wish I could get my hands on it.
Wednesday, October 31, 2007 6:15:23 PM UTC
I also use the Microsoft LogParser to analyze at least IIS logs when I need to. However, I went and added a little bit cleaner Front End Gui for the LogParser.dll to make things at least for me a little easier to use. It is written in VB.Net 2003, since that is what I have been playing around with lately. If you want to see what I did feel free to contact me and I will gladly send you the code.
Brandon
Wednesday, October 31, 2007 7:13:46 PM UTC
/n software has this custom Cmdlet to allow you to remotely manage windows. I thought I'd just share the link with y'all.
Wednesday, October 31, 2007 8:05:15 PM UTC
i don't have the IQ for PowerShell, sigh.
John A Davis
Wednesday, October 31, 2007 9:10:57 PM UTC
For those of you suffering through the screaming bizareness that is regular expressions, Check out a rather inexpensive program called Regex Buddy. The help file alone is worth the price of the tool.
Thursday, November 01, 2007 8:42:32 AM UTC
Here is a C# version using the OS FileHelpers library which didn't take very long and I think is easier to read.


Thursday, November 01, 2007 11:42:48 AM UTC
Start
Excel
File Open
Pivot Table

:-)
Mike Parsons
Thursday, November 01, 2007 4:13:35 PM UTC
Here is a C# version using the OS FileHelpers library which didn't take very long and I think is easier to read. Also a commenter added a python version as well!


Thursday, November 01, 2007 4:15:06 PM UTC
Sorry, about the double comment, feel free to whack it. Too many browser windows open!
Thursday, November 01, 2007 5:14:33 PM UTC
Ok internet, help me get over the hump and start using PowerShell.

I have a command-line program that will output text to the screen. I want to parse values from that output. This seems like one of the most basic operations that PowerShell would thrive at, but I'm having no luck.

To be more concrete, I'm trying to get a list of changesets for a directory in my VSTS source code repository, and then do something with each of them. I'm using the tfpt (TF Power Toys) history command:
tfpt history /recursive /stopafter:3 MyRootFolder
The output (from cmd.exe or powershell) is:
Changeset User Date Comment
------------
1234 someone 11/1/2007 something
1235 someone 11/1/2007 something
1242 someone 11/1/2007 something

I want to get 1234, 1235, 1242 in some kind of data structure that will allow me to loop over them and do something with each changeset.
I started down the road of using a regex: ^\d+
but I cannot figure out how to easily apply it to the output.

I know this has to be easy. Make me look stupid, please.
Thursday, November 01, 2007 10:44:10 PM UTC
Josh...I think it's:

tfpt history /recursive /stopafter:3 MyRootFolder | select Changeset | foreach-object { dosomething.exe $_ }

where $_ is the changesset number. You could also do:

tfpt history /recursive /stopafter:3 MyRootFolder | foreach-object { dosomething.exe $_.Changeset }


Saturday, November 03, 2007 8:41:33 PM UTC
Joshua, here's another approach. Keep in mind that PoSh has some fantastic capabilities when utilities output .NET objects into the pipeline. Unfortunately tfpt.exe isn't one of those so you have to drop back to text parsing. Fortunately PoSh has some pretty good tools for doing that.

17> $pattern = '^(?<changeset>\d+)\s+(?<user>\w+)\s+(?<date>[^ ]+)\s+(?<comment>.*)$'
18> $changesets = @()
19> tfpt history /recursive /stopafter:3 MyRootFolder | foreach {if ($_ -match $pattern) {$changesets += $matches}}
20> $changesets

Name Value
---- -----
date 11/1/2007
user someone
changeset 1234
comment something
0 1234 someone 11/1/2007 something
date 11/1/2007
user someone
changeset 1235
comment something
0 1235 someone 11/1/2007 something
date 11/1/2007
user someone
changeset 1242
comment something
0 1242 someone 11/1/2007 something

In this example I initialize an empty array called $changesets. Then I do a regex match against each line output by TFPT. Then ones that match, result in a global called $matches to be filled in with all the named captures. I just append this instance of $matches to the $changesets array. There are other ways to do this including creating a custom psobject and then dynamically adding properties to it. However the approach above is somewhat easier and suits the need I think.
BTW I pester the TFS team (Brian Harry, James Manning, etc) for PowerShell support (TFS cmdlets and providers) whenever I get the opportunity. I suggest that you do the same - if you get the chance. :-)
Thursday, November 29, 2007 6:22:54 PM UTC
> I swear, everyone has "been meaning to try PowerShell but I haven't had the time."

You're so right; I haven't had the time for about a year now!
Comments are closed.

Disclaimer: The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.