Scott Hanselman

Internationalized Regular Expressions

April 23, '05 Comments [10] Posted in ASP.NET | Javascript
Sponsored By

UPDATE: There's more on Internationalized RegExs in this StackOverflow question.

I was trying to make a regular expression for use in client-side JavaScript (using a PeterBlum Validator) that allowed a series of special characters:

-'.,&#@:?!()$\/

Plus letters and numbers and whitespace:

\w\d\s

However, I mistakenly assumed that \w meant truly "word characters." It doesn't, it means [A-Za-z].

That sucks. What about José, when he wants to put his First Name into a form?

Well, I could do a RegEx that denies specific characters and allows all others, but I really just wanted to support Spanish, French, English, German, and any language that uses the general Latin Character Set.

So, here's what I have.

^[
  ÀÈÌÒÙ àèìòù ÁÉÍÓÚ Ý áéíóúý
  ÂÊÎÔÛ âêîôû ÃÑÕ ãñõ ÄËÏÖÜŸ
  äëïöüŸ ¡¿çÇŒœ ßØøÅå ÆæÞþ
  Ðð ""\w\d\s-'.,&#@:?!()$\/
]+$

Did I miss anything? (Ignore the  whitespace for the purposes of this post's RegEx)

It's lame that \w doesn't work on the client-side based on your browser's locale. This makes it difficult for your RegExes to have parity between the client and server.

About Scott

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.

facebook twitter subscribe
About   Newsletter
Sponsored By
Hosting By
Dedicated Windows Server Hosting by SherWeb
Saturday, 23 April 2005 06:58:51 UTC
I believe if you use the RegexOptions.ECMAScript option, you'll get the behavior you're looking for with \w. With that option set, \w is equivalent to the unicode character classes [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}].
Saturday, 23 April 2005 07:18:39 UTC
Of course, but I didn't make myself clear, I'm talking about CLIENT-SIDE JAVASCRIPT.

I'll update the post.
Saturday, 23 April 2005 07:25:08 UTC
Haacked, you've got it reversed, BTW. ECMAScript is ignorant of Unicode, hence my client-side problem. Specifying RegexOptions.ECMAScript turns OFF functionality.

From MSDN:

"Character classes are specified differently in matching expressions. Canonical regular expressions support Unicode character categories by default. ECMAScript does not support Unicode.

Matches any word character. Equivalent to the Unicode character categories
[\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}]. If ECMAScript-compliant behavior is specified with the ECMAScript option, \w is equivalent to [a-zA-Z_0-9]."
Scott Hanselman
Saturday, 23 April 2005 09:00:47 UTC
what do you mean by "s-'"? You should put the "-" at the start of the expression... otherwise it means all characters in between "s" and "'"...
And what about &? Do regular expressions know about HTML-Entities? I guess the correct Expression would be

^[-ÀÈÌÒÙ àèìòù ÁÉÍÓÚ Ý áéíóúý
ÂÊÎÔÛ âêîôû ÃÑÕ ãñõ ÄËÏÖÜŸ
äëïöüŸ ¡¿çÇŒœ ßØøÅå ÆæÞþ
Ðð ""\w\d\s'.,&;#@:?!()$\/
]+$

wouldn't it?

Regards,
Ralf
Saturday, 23 April 2005 10:49:00 UTC
I don't know if client side Javascript will do what you want here. I did answer the generic question you asked about "what's the i18n 'right thing to do' when using regular expressions?" at http://blogs.msdn.com/michkap/archive/2005/04/23/411106.aspx but I doubt that will help with client-side atuff.
Saturday, 23 April 2005 14:31:51 UTC
"Well, I could do a RegEx that denies specific characters and allows all others [...]"

You should just do that...
BTW, which characters do you want to exclude?
Diego Mijelshon
Saturday, 23 April 2005 20:32:29 UTC
Ralf, what you thought was s- is actually /s (spaces) then a DASH. So I want /w /d /s, then the list of special chars shows up.
Sunday, 24 April 2005 19:09:57 UTC
right :-) didn't see the / in front of the s...

I guess that's what makes those regular expressions hard to read :-)
Tuesday, 24 May 2005 15:55:52 UTC
Writing code to support a few specific languages might be great for your app, but in general I would agree with the poster than recommended excluding a small number of characters and allowing everything else. This will do the right job most of the time for most applications. If you would rather write code that includes characters you can include a large swath of Unicode by using something like \u80-\uFFFFFF. Javascript needs to implement the Unicode regular expression spec, but until then it is pretty close to impossible to do the right thing all the time for all users.

ed batutis
Monday, 18 July 2005 19:08:55 UTC
Hi,

Can anyone refer me to a working example of Scott's regex. When I try using it in a script, it does not load in Firefox or IE. Here is my code:

// Is a proper name?
function proper() {
if (field.type == "text" || field.type == "textarea") {
var regx = /^[ÀÈÌÒÙàèìòùÁÉÍÓÚÝáéíóúýÂÊÎÔÛâêîôûÃÑÕãñõÄËÏÖÜäëïöü¡¿çÇßØøÅåÆæÞþÐð""\w\d\s-'.,&#@:?!()$\/]+$/;
if (field.value.length > 0 && !regx.test(field.value)) {
alert('Not a proper name');
return false;
}
}
return true;
};

Note: The above funtion is part of a class and the 'field' variable is set when the class is instantiated.

Thank you,
Daniel
Daniel Fréchette
Comments are closed.

Disclaimer: The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.