« Wow, that's cool...the new book | Main | Continuous Integration for .NET - Patric... »

Internationalized Regular Expressions

Posted 2005-04-22 11:09 PM in ASP.NET | Javascript.

UPDATE: There's more on Internationalized RegExs in this StackOverflow question.

I was trying to make a regular expression for use in client-side JavaScript (using a PeterBlum Validator) that allowed a series of special characters:

-'.,&#@:?!()$\/

Plus letters and numbers and whitespace:

\w\d\s

However, I mistakenly assumed that \w meant truly "word characters." It doesn't, it means [A-Za-z].

That sucks. What about José, when he wants to put his First Name into a form?

Well, I could do a RegEx that denies specific characters and allows all others, but I really just wanted to support Spanish, French, English, German, and any language that uses the general Latin Character Set.

So, here's what I have.

^[
  ÀÈÌÒÙ àèìòù ÁÉÍÓÚ Ý áéíóúý
  ÂÊÎÔÛ âêîôû ÃÑÕ ãñõ ÄËÏÖÜŸ
  äëïöüŸ ¡¿çÇŒœ ߨøÅå ÆæÞþ
  Ðð ""\w\d\s-'.,&#@:?!()$\/
]+$

Did I miss anything? (Ignore the  whitespace for the purposes of this post's RegEx)

It's lame that \w doesn't work on the client-side based on your browser's locale. This makes it difficult for your RegExes to have parity between the client and server.

Tracked by:
"Internationalized Regular Expressions" (Tech Guru) [Trackback]


Friday, April 22, 2005 10:58:51 PM (Pacific Standard Time, UTC-08:00)
I believe if you use the RegexOptions.ECMAScript option, you'll get the behavior you're looking for with \w. With that option set, \w is equivalent to the unicode character classes [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}].
Friday, April 22, 2005 11:18:39 PM (Pacific Standard Time, UTC-08:00)
Of course, but I didn't make myself clear, I'm talking about CLIENT-SIDE JAVASCRIPT.

I'll update the post.
Friday, April 22, 2005 11:25:08 PM (Pacific Standard Time, UTC-08:00)
Haacked, you've got it reversed, BTW. ECMAScript is ignorant of Unicode, hence my client-side problem. Specifying RegexOptions.ECMAScript turns OFF functionality.

From MSDN:

"Character classes are specified differently in matching expressions. Canonical regular expressions support Unicode character categories by default. ECMAScript does not support Unicode.

Matches any word character. Equivalent to the Unicode character categories
[\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}\p{Pc}]. If ECMAScript-compliant behavior is specified with the ECMAScript option, \w is equivalent to [a-zA-Z_0-9]."
Scott Hanselman
Saturday, April 23, 2005 1:00:47 AM (Pacific Standard Time, UTC-08:00)
what do you mean by "s-'"? You should put the "-" at the start of the expression... otherwise it means all characters in between "s" and "'"...
And what about &? Do regular expressions know about HTML-Entities? I guess the correct Expression would be

^[-ÀÈÌÒÙ àèìòù ÁÉÍÓÚ Ý áéíóúý
ÂÊÎÔÛ âêîôû ÃÑÕ ãñõ ÄËÏÖÜŸ
äëïöüŸ ¡¿çÇŒœ ߨøÅå ÆæÞþ
Ðð ""\w\d\s'.,&;#@:?!()$\/
]+$

wouldn't it?

Regards,
Ralf
Saturday, April 23, 2005 2:49:00 AM (Pacific Standard Time, UTC-08:00)
I don't know if client side Javascript will do what you want here. I did answer the generic question you asked about "what's the i18n 'right thing to do' when using regular expressions?" at http://blogs.msdn.com/michkap/archive/2005/04/23/411106.aspx but I doubt that will help with client-side atuff.
Saturday, April 23, 2005 6:31:51 AM (Pacific Standard Time, UTC-08:00)
"Well, I could do a RegEx that denies specific characters and allows all others [...]"

You should just do that...
BTW, which characters do you want to exclude?
Diego Mijelshon
Saturday, April 23, 2005 12:32:29 PM (Pacific Standard Time, UTC-08:00)
Ralf, what you thought was s- is actually /s (spaces) then a DASH. So I want /w /d /s, then the list of special chars shows up.
Sunday, April 24, 2005 11:09:57 AM (Pacific Standard Time, UTC-08:00)
right :-) didn't see the / in front of the s...

I guess that's what makes those regular expressions hard to read :-)
Tuesday, May 24, 2005 7:55:52 AM (Pacific Standard Time, UTC-08:00)
Writing code to support a few specific languages might be great for your app, but in general I would agree with the poster than recommended excluding a small number of characters and allowing everything else. This will do the right job most of the time for most applications. If you would rather write code that includes characters you can include a large swath of Unicode by using something like \u80-\uFFFFFF. Javascript needs to implement the Unicode regular expression spec, but until then it is pretty close to impossible to do the right thing all the time for all users.

ed batutis
Monday, July 18, 2005 11:08:55 AM (Pacific Standard Time, UTC-08:00)
Hi,

Can anyone refer me to a working example of Scott's regex. When I try using it in a script, it does not load in Firefox or IE. Here is my code:

// Is a proper name?
function proper() {
if (field.type == "text" || field.type == "textarea") {
var regx = /^[ÀÈÌÒÙàèìòùÁÉÍÓÚÝáéíóúýÂÊÎÔÛâêîôûÃÑÕãñõÄËÏÖÜäëïöü¡¿çÇߨøÅ寿ÞþÐð""\w\d\s-'.,&#@:?!()$\/]+$/;
if (field.value.length > 0 && !regx.test(field.value)) {
alert('Not a proper name');
return false;
}
}
return true;
};

Note: The above funtion is part of a class and the 'field' variable is set when the class is instantiated.

Thank you,
Daniel
Daniel Fréchette
Comments are closed.

Contact

Sponsors

Hosting By

Hot Topics

Tags

Calendar

<November 2009>
SunMonTueWedThuFriSat
25262728293031
1234567
891011121314
15161718192021
22232425262728
293012345

Archives

November, 2009 (2)
October, 2009 (19)
September, 2009 (11)
August, 2009 (12)
July, 2009 (21)
June, 2009 (26)
May, 2009 (16)
April, 2009 (13)
March, 2009 (17)
February, 2009 (17)
January, 2009 (18)
December, 2008 (32)
November, 2008 (17)
October, 2008 (22)
September, 2008 (16)
August, 2008 (14)
July, 2008 (25)
June, 2008 (19)
May, 2008 (17)
April, 2008 (17)
March, 2008 (26)
February, 2008 (21)
January, 2008 (28)
December, 2007 (19)
November, 2007 (17)
October, 2007 (31)
September, 2007 (39)
August, 2007 (37)
July, 2007 (43)
June, 2007 (37)
May, 2007 (32)
April, 2007 (38)
March, 2007 (29)
February, 2007 (46)
January, 2007 (31)
December, 2006 (27)
November, 2006 (31)
October, 2006 (32)
September, 2006 (39)
August, 2006 (34)
July, 2006 (40)
June, 2006 (18)
May, 2006 (31)
April, 2006 (34)
March, 2006 (30)
February, 2006 (38)
January, 2006 (44)
December, 2005 (19)
November, 2005 (34)
October, 2005 (24)
September, 2005 (37)
August, 2005 (20)
July, 2005 (24)
June, 2005 (33)
May, 2005 (16)
April, 2005 (22)
March, 2005 (34)
February, 2005 (15)
January, 2005 (37)
December, 2004 (28)
November, 2004 (30)
October, 2004 (34)
September, 2004 (22)
August, 2004 (34)
July, 2004 (18)
June, 2004 (64)
May, 2004 (49)
April, 2004 (21)
March, 2004 (29)
February, 2004 (29)
January, 2004 (36)
December, 2003 (25)
November, 2003 (24)
October, 2003 (59)
September, 2003 (42)
August, 2003 (24)
July, 2003 (44)
June, 2003 (29)
May, 2003 (21)
April, 2003 (30)
March, 2003 (27)
February, 2003 (47)
January, 2003 (50)
December, 2002 (31)
November, 2002 (38)
October, 2002 (44)
September, 2002 (15)
May, 2002 (2)
April, 2002 (4)

Google Ads