Scott Hanselman

The Importance of being UTF-8

September 29, 2006 Comment on this post [7] Posted in ASP.NET | Internationalization
Sponsored By

Kevin Hammond at wanted the title of his blog to be "Casa dé Hambone" - note the é. He's running DasBlog and saw "Casa d Hambone" - note the missing é.

I knew/know that this works fine in DasBlog because it's been internationalized since Day 1 - we've got 14 languages out of the box. He sent me his site.config file (that's where DasBlog stores its configuration) and I opened it in Notepad2.

Notice in the screenshot that this file is saved as ANSI/ASCII. This file was probably manually edited with a non-clever editor.


However, if you do a straight convert, of course, you'll lose data (and Notepad2 warns you of this fact). Notice what happens when I do a convert via File|Encoding:


This is one situation where the Windows Clipboard works great and can save you a hassle. I selected all , copied to the clipboard, changed the encoding, then pasted.


Now we're cool. ASP.NET and .NET in general will almost always "do the right thing" if you're using UTF-8. You can certainly specify alternate encodings if you like when you're opening a file via code. We use the StreamReader internally and the docs say:

StreamReader defaults to UTF-8 encoding unless specified otherwise, instead of defaulting to the ANSI code page for the current system. UTF-8 handles Unicode characters correctly and provides consistent results on localized versions of the operating system.

Joel's got a good article I've pointed to before about Internationalization. I've also got some posts in my Internationalization/i18n category.

Changing to UTF-8 fixed Kevin's problem.

About Scott

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.

facebook twitter subscribe
About   Newsletter
Hosting By
Hosted in an Azure App Service
September 29, 2006 23:59
Interesting. I wonder if my BatchEncoder tool would've done the right thing in this situation.
September 30, 2006 0:00
I've run into this problem trying to pass unicode strings into one the apps you pointed out, console. also ran into similar situation with the text editors in general and having to use characters out of the range of normal ascii values!
September 30, 2006 2:32
Humm... there is one place where dasBlog breaks: category and post names containing special characters (including + or an accented caracter). The link gets generated correctly, but dasBlog fails to find it (hence why the C++ category in my blog doesn't work :))
September 30, 2006 2:40
Hm...I thought we fixed that. I'll take a look.
September 30, 2006 6:43
I ran into this once with Voyager config files... in production. I was banging my head up against the wall for 15 minutes before I thought to check the encoding. Damn you and your ANSI default, notepad.exe! :)
September 30, 2006 8:51
Quick question. I opened a web.config file here with Notepad2, converted it to UTF-8. I then opened it with VS 2005, made a change and re-saved. I checked it again in Notepad2 and it was back to ANSI. Seems weird that VS 2005 does this. Any idea why?
September 30, 2006 8:56
John - When you only have ASCII text in a file, unless you've saved it with a Unicode BOM (Byte Order Mark) UTF8 will be byte-for-byte identical to ASCII. It's only when you include a character in the high-latin space or higher that it'll make a difference. In Kevin's case, he stepped outside the basics. It's not that VS2005 is 'changing' the file to ASCII/ANSI, it's that it doesn't have a way to glean that it's UTF8 because there's no characters in the file that are distinguishable from ASCII, so for all intents, it is.

Comments are closed.

Disclaimer: The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.