Scott Hanselman

Unix Fight! - Sed, Grep, Awk, Cut and Pulling Groups out of a PowerShell Regular Expression Capture

August 1, '11 Comments [15] Posted in PowerShell
Sponsored By

There's a wonderful old programmers joke I've told for years:

"You've got a problem, and you've decided to use regular expressions to solve it.

Ok, now you've got two problems..."

A friend of mine was talking on a social network and said something like:

"That decade I spent in the Windows world stunted my growth. one teeny-tiny unix command grabbed certain values from an XML doc for me."

Now, of course, I took this immediately as a personal challenge and rose up in a rit of fealous jage and defended my employer. Nah, not really as I worked at Nike on Unix for a number of years and I get the power of sed and awk and what not. However, he said XML, and well, PowerShell rocks XML.

Because it's a dynamic language, you can refer to XML nodes just like this:

$a = ([xml](new-object net.webclient).downloadstring("http://feeds.feedburner.com/Hanselminutes"))
$a.rss.channel.item

The first line gets the feed and the second line gets all the items.

However, turns out my friend was actually trying to retrieve values within poorly-formed XML fragments within a larger SQL dump file. There's three kinds of XML. Well-formed, valid, and crap. He was sifting through crap for some values. Basically he had this crazy text file with some fragments of XML within it and wanted the values in-between elements: "<FancyPants>He wants this value</FancyPants>."

Something like this:

grep "<FancyPants>.*<.FancyPants>" test.txt | sed -e "s/^.*<FancyPants/<FancyPants/" | cut -f2 -d">"| cut -f1 -d"<" > fancyresults.txt

I'm old, but I'm not an expert in grep and sed so I'm sure there are ways he could have done it more tersely. There always is, right? With regular expressions, sometimes someone just types $@($*@)$(*@)(@*)@*(%@%# and Shakespeare pops out. You never know.

There's also a lot of different ways to do this in PowerShell, but since he used RegExes, who am I to disagree?

First, here's the one line answer.

cat test.txt | foreach-object {$null = $_ -match '<FancyPants>(?<x>.*)<.FancyPants>'; $matches.x}

But I thought I'd also sort them, remove duplicates...

cat test.txt | foreach-object {$null = $_ -match '<FancyPants>(?<x>.*)<.FancyPants>'; $matches.x} | sort | get-unique

But foreach-object can be aliased as % and get-unique can be just "gu" so the final answer is:

cat test.txt | % {$null = $_ -match '<FancyPants>(?<x>.*)<.FancyPants>';$matches.x} | sort | gu

I think we can agree at they are both hard to read. I still love PowerShell.

Related Links:

About Scott

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. I am a failed stand-up comic, a cornrower, and a book author.

facebook twitter subscribe
About   Newsletter
Sponsored By
Hosting By
Dedicated Windows Server Hosting by ORCS Web
Monday, August 01, 2011 9:47:21 PM UTC
cat itself is alias for get-content with the standard alias of "gc" - you can shave one more character :)
Monday, August 01, 2011 9:47:28 PM UTC
don't forget select-string

gci . *.csproj -rec | select-string "<HintPath>(.*)</HintPath>" | % { $_.Matches } | % { $_.Groups[1].Value }
fschwiet
Monday, August 01, 2011 9:47:35 PM UTC
In a similar vein, if you've never checked out the Command Line Kung Fu blog, I highly recommend it.

Someone posts a question asking how to do something on the command line, and the authors try to do it using Unix commands, the Windows command line, and Powershell. Some of them are really neat.
Monday, August 01, 2011 10:08:44 PM UTC
Please note that the regex you posted is not for Shakespeare, but for Samuel Taylor, the regex for Shakespeare has an additional pair of parenthesis and a question mark.
Monday, August 01, 2011 10:55:55 PM UTC
There are a few thnigs you can use to simplify this:

1. Sort-Object has a -Unique parameter, so you don't actually need Get-Unique there.
2. You can save the ForEach-Object in there by just running replace over every line that matches the pattern:


(gc test.txt) -match '<(FancyPants)>.*</\1>' -replace '.*<FancyPants>|</FancyPants>.*' | sort -u


Ok, it's not much simpler, since you need to filter the lines that match first (akin to the grep and sed from before). Oh, well.
Johannes
Monday, August 01, 2011 11:06:43 PM UTC
One comment that this crowd has not mentioned...maybe because it is not fact, rather opinion and this crowd is smart enough to only speak of facts. But...IMO the PowerShell version is much easier to read therefore (theoretically) easier to maintain.
Tuesday, August 02, 2011 12:53:03 AM UTC
It's a great quote - I use it myself - but credit for it goes to Jamie Zawinski: http://regex.info/blog/2006-09-15/247
JohnW
Tuesday, August 02, 2011 3:57:39 AM UTC
A few other quotes from jwz (outside of jwz.org and @jwz on twitter, of course) can be found @ http://en.wikiquote.org/wiki/Jamie_Zawinski

Tuesday, August 02, 2011 10:59:44 AM UTC
The original example will need to be de-duplicated because $matches is not reset on the lines where -match finds nothing, so after a match is found, every line will return something. It's better to have
  % {If ($_ -match '<FancyPants>(?<x>.*)<.FancyPants>') { $matches.x}}


Instead of using a -match and a -split or a cat / gc and a -match you can use select-string

select-string "(?<=(FancyPants)>).*(?=</\1)" '.\test.txt' | %{$_.matches[0].value} | sort -u


Though if you'll be maintaining it
select-string -pattern "(?<=(FancyPants)>).*(?=</\1)" -path '.\test.txt' | foreach-object {$_.matches[0].value} | sort-object -unique


Tuesday, August 02, 2011 11:19:34 AM UTC
Using gawk gives you access to the regex groups on Linux:

gawk 'match($0, /.*<FancyPants>(.*)<\/FancyPants>.*/, a) {print a[1]}' test.txt
Tuesday, August 02, 2011 12:50:01 PM UTC
Select-String can also be abbreviated to "ss".

At work I use it all the time to search through the code:
dir -Recurse -Filter *.cs | ss "something"
Andreas
Tuesday, August 02, 2011 2:35:11 PM UTC
This doesnt work if the desired xml node appears more than once on the same line.

If the content of file is the following (all in one line):
<FancyPants>He wants this value</FancyPants><FancyPants>and this value</FancyPants>

the result of the command would be:
He wants this value</FancyPants><FancyPants>and this value
Buma
Tuesday, August 02, 2011 9:01:19 PM UTC
@Buma, true. In that case you'd want something like

$content = [io.file]::ReadAllText('c:\test.txt'); $content -match '(?is)...'; ...

to kick the regex into Singleline mode. It's criminal IMO that Get-Content doesn't have a parameter to force reading the file as a single string. It would be very useful for situatations like this where a pattern to be matched can span multiple lines.

Also note that file paths do *not* need to be quoted in PowerShell unless there's a space or a special PowerShell character you need to escape.

Finally, I would select-string for this but the result is longer:

select-string '<FancyPants>(?<x>.*)<.FancyPants>' test.txt | %{$_.matches} | %{$_.Groups[1].value} | sort -u
Wednesday, August 03, 2011 6:50:48 PM UTC
Good grief man, when do you ever sleep?

Thanks, as always, for another superb article that schooled me on techniques I will now use and claim as my own.
Sunday, August 28, 2011 11:27:41 AM UTC
Great article, you have indeed covered topic in detail with samples. I have also blogged my experience as 10 examples of grep command in unix ,let me know how do you find it. Thanks
Comments are closed.

Disclaimer: The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.