Scott Hanselman

How to convert a text file with a specific Codepage to UTF-8 using Visual Studio .NET

November 15, '06 Comments [6] Posted in Internationalization
Sponsored By

A partner recently sent me a RESX (.NET Resource) text file in Hebrew for a project I'm working on. When I opened it, it looked really bad, as seen in the screenshot below.

I assumed this was a classic ANSI vs. UTF-8 problem, so I tried to do a quick conversion within Notepad2, but that resulted in nonsense East Asian characters from all over:

This means that the original file was in fact in ASCII, but just not using an English codepage. I've talked a little about codepages in the Hanselminutes Podcast on Internationalization in the past and regularly point folks to the Joel on Software article on Internationalization.

In a nutshell, a codepage is a "view" that someone can use to look

Ordinarily, I recommend that everyone (including Israelis) use Unicode/UTF-8 to represent their Hebrew characters, however, a lot of folks who write Hebrew use the Windows 1255 codepage for their encoding. (BTW, go here, scroll down, and bask in the glory of this reference site.)

So, how do I convert from one codepage to another, or in my case, from Hebrew 1255 to UTF8?

We start with Visual Studio.NET.

  • From the File menu, select Open to bring up the dialog. Note that you'll have to select a file before the Open button is un-grayed.
  • Select the tiny down-arrow on the right side of the Open button and select "Open With..."
    • -5 points for obscure UI design here
  • Next, select "Source Code (Text) Editor with Encoding." The With Encoding part is crucial, otherwise the next dialog won't appear.
  • From the Encoding Dialog, select the source document's encoding - in my case, that's Hebrew 1255.
    • -10 points for obscure UI design here.
  • At this point the document should appear correctly in Visual Studio.NET. Now, reverse the process by selecting File|Save As and clicking the tiny down arrow on the Save button, and selecting both the desired Encoding and Line Endings.

Now you've successfully normalized your document to UTF-8, and it should be usable as a RESX file or whatever you like. Here's that same document loaded into Notepad2.

And it was Good.

About Scott

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.

facebook twitter subscribe
About   Newsletter
Sponsored By
Hosting By
Dedicated Windows Server Hosting by ORCS Web
Thursday, 16 November 2006 05:15:09 UTC
Interesting timing, I just wrote a VB6 application last week that will convert a file from a codepage to utf8. It's just too bad that there is no real good way to determine what codepage the original file uses.
Thursday, 16 November 2006 07:16:02 UTC
Or you could just swap Notepad2 for EmEditor, a much better editor in my opinion...
Daniel B
Thursday, 16 November 2006 08:38:04 UTC
Yeah - Visual Studio is the all-embracing tool for solving problems around the world since the early 90s ;-)

I'm working as software engineer in the localization business - you probably can imagine how often I have to mess around with silly codepage problems :-/ I think there is just one tool with similar conversion capabilities like VS, which furthermore is widely available: MS Word. Sounds strange, but seems to be so.
Thursday, 16 November 2006 13:37:55 UTC
nice to see some hebrew words in your blog
Thursday, 16 November 2006 18:21:12 UTC
I agree with Rene. In a previous life I was doing localization for Wizards of the Coast (Magic the Gathering) and we had grandiose plans of localizing ALL our content. Word does have the same capabilities, although IIRC it was nearly the same convoluted process as using VS.
Thursday, 16 November 2006 18:42:32 UTC
Obscure UI design, indeed! I think you're shortchanging them with only 5 points. The 'drop-down' open and save buttons are an abomination. Even when I know about them and have used them before, that bit of functionality never sticks in my head. I mean, who ever really *looks* at an 'Open' or 'Save' button anymore - they're as automatic as 'OK' (I wonder when I'll get the drop-down 'Not-really,-but-if-you-insist' option on my 'OK' buttons?)

I actually purchased UltraEdit because I had a need to deal with .rc files in several codepages, and it was the only editor that made dealing with multiple encodings and/or codepages possible for me to figure out.

Note that I'm not by any means saying that it's the only editor that can deal with them - it's just I couldn't figure out how to easily do my work with other editors. UE made it pretty easy, though you do have to remember to not only change the codepage, but also the font and the font's script - UE always reminds me of that. This is one of the the only pestering, constant reminder dialogs that I actually appreciate. 'Cause, Lord knows that when I go to do that type of work again 8 months from now, I sure as Hell won't remember.

Oh well, UE's pretty cheap, and it's a pretty good editor to have in my toolbox anyway.
Comments are closed.

Disclaimer: The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.