Internationalized Regular Expressions

April 23, 2005 Comment on this post [10] Posted in ASP.NET | Javascript

About Scott

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.

About Newsletter

Hosting By

Hosted on Linux using .NET in an Azure App Service

Comment on this post [10]

Share on BlueSky or use the Permalink and post anywhere!

April 23, 2005 10:58

I believe if you use the RegexOptions.ECMAScript option, you'll get the behavior you're looking for with \w. With that option set, \w is equivalent to the unicode character classes [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}].

haacked@gmail.com (Haacked)

April 23, 2005 11:18

Of course, but I didn't make myself clear, I'm talking about CLIENT-SIDE JAVASCRIPT.

I'll update the post.

Scott Hanselman

April 23, 2005 11:25

Haacked, you've got it reversed, BTW. ECMAScript is ignorant of Unicode, hence my client-side problem. Specifying RegexOptions.ECMAScript turns OFF functionality.

From MSDN:

"Character classes are specified differently in matching expressions. Canonical regular expressions support Unicode character categories by default. ECMAScript does not support Unicode.

Matches any word character. Equivalent to the Unicode character categories
[\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}]. If ECMAScript-compliant behavior is specified with the ECMAScript option, \w is equivalent to [a-zA-Z_0-9]."

Scott Hanselman

April 23, 2005 13:00

what do you mean by "s-'"? You should put the "-" at the start of the expression... otherwise it means all characters in between "s" and "'"...
And what about &? Do regular expressions know about HTML-Entities? I guess the correct Expression would be

^[-ÀÈÌÒÙ àèìòù ÁÉÍÓÚ Ý áéíóúý
ÂÊÎÔÛ âêîôû ÃÑÕ ãñõ ÄËÏÖÜŸ
äëïöüŸ ¡¿çÇŒœ ßØøÅå ÆæÞþ
Ðð ""\w\d\s'.,&;#@:?!()$\/
]+$

wouldn't it?

Regards,
Ralf

Ralf Mueller

April 23, 2005 14:49

I don't know if client side Javascript will do what you want here. I did answer the generic question you asked about "what's the i18n 'right thing to do' when using regular expressions?" at http://blogs.msdn.com/michkap/archive/2005/04/23/411106.aspx but I doubt that will help with client-side atuff.

Michael Kaplan

April 23, 2005 18:31

"Well, I could do a RegEx that denies specific characters and allows all others [...]"

You should just do that...
BTW, which characters do you want to exclude?

Diego Mijelshon

April 24, 2005 0:32

Ralf, what you thought was s- is actually /s (spaces) then a DASH. So I want /w /d /s, then the list of special chars shows up.

Scott Hanselman

April 24, 2005 23:09

right :-) didn't see the / in front of the s...

I guess that's what makes those regular expressions hard to read :-)

Ralf Mueller

May 24, 2005 19:55

Writing code to support a few specific languages might be great for your app, but in general I would agree with the poster than recommended excluding a small number of characters and allowing everything else. This will do the right job most of the time for most applications. If you would rather write code that includes characters you can include a large swath of Unicode by using something like \u80-\uFFFFFF. Javascript needs to implement the Unicode regular expression spec, but until then it is pretty close to impossible to do the right thing all the time for all users.

ed batutis

July 18, 2005 23:08

Hi,

Can anyone refer me to a working example of Scott's regex. When I try using it in a script, it does not load in Firefox or IE. Here is my code:

// Is a proper name?
function proper() {
if (field.type == "text" || field.type == "textarea") {
var regx = /^[ÀÈÌÒÙàèìòùÁÉÍÓÚÝáéíóúýÂÊÎÔÛâêîôûÃÑÕãñõÄËÏÖÜäëïöü¡¿çÇßØøÅåÆæÞþÐð""\w\d\s-'.,&#@:?!()$\/]+$/;
if (field.value.length > 0 && !regx.test(field.value)) {
alert('Not a proper name');
return false;
}
}
return true;
};

Note: The above funtion is part of a class and the 'field' variable is set when the class is instantiated.

Thank you,
Daniel

Daniel Fréchette

Comments are closed.