Scott Hanselman

Converting from a String Representation of a Unicode Character back into a char

April 18, '05 Comments [4] Posted in Internationalization
Sponsored By

Hopefully Michael Kaplan will step in here and explain some edge case or just a general comment like "that's totally wrong, Scott" - but until he does:

A fellow emailed me this question:

I want to convert a string representation of a Unicode character back into a 'char' in .NET C#.  Can you help?
 
i.e."U+0041" which is Hexidecimal for 65 which is ASCII for "A"
 
There's got to be a built in function(s) for this, and I just can't seem to find them?
 
To give you an idea, the pseudocode would be something like:
 
string s = "U+0041";
char c = new ?Unicode.Decoder.Decode?(s);
textBox1.Text = c.ToString();

Now, I have no idea why this gentleman would want to do this, but it's interesting enough. Here's what I came up with. I'm sure there's a better way.

//Just a reminder that you can use \u to escape Unicode in C#
char c = '\u0063';
Console.WriteLine(c.ToString());

//Here's how you'd go from a string to stuff like
// U+0053 U+0063 U+006f
string scott = "Scott and the letter c";
foreach(char s in scott)
{
	Console.Write("U+{0:x4} ",(int)s);
}
		
//Here's how converted a string (assuming it starts with U+)
// containing the representation of a char
// back to a char
// Is there a built in, or cleaner way? Would this work in Chinese?
string maybeC = "U+0063";
int p = int.Parse(maybeC.Substring(2), System.Globalization.NumberStyles.HexNumber);
Console.WriteLine((char)p);

Now playing: Craig Armstrong - Ray's Theme

About Scott

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.

facebook twitter subscribe
About   Newsletter
Sponsored By
Hosting By
Dedicated Windows Server Hosting by ORCS Web
Monday, April 18, 2005 2:11:55 AM UTC
Well, you do have to know what code page. To convert using the default code page, you can use

Encoding.Default.GetBytes(stUnicodeString)

to get back a byte array containing the non-Unicode character(s).
Monday, April 18, 2005 4:19:17 AM UTC
Ok, I get it, the question is confusing. He's not asking "how do I convert a string representation", he's asking "how do I convert a BYTE representation" Hex is just a string-y way of representing bytes (along with base64, etc).

So,

b = Text.Encoding.Unicode.GetBytes(s)
s = Text.Encoding.Unicode.GetString(b)

where s = string and b = array of bytes
Monday, April 18, 2005 6:03:35 AM UTC
Depending on the requirement, you should also be aware of System.Globalization.StringInfo.
Tuesday, April 19, 2005 11:46:34 PM UTC
Since Michael didn't say it: that works for a UCS-2 string, but not for a UTF-16 string. Granted, few strings have UTF-16 bits, but isn't it more fun to make it completely right?

// Completely untested

// String to Unicode code points
string scott = "Scott and the letter c";
int highbits = 0;
foreach (char ch in scott)
{
/**/ int i = (int) ch;
/**/ if (i < 0xD800 || i > 0xDFFF)
/**/ /**/ Console.Write("U+{0:x4} ", i);
/**/ else if (i < 0xDC00) // ... Surrogate high
/**/ /**/ highbits = i - 0xD800;
/**/ else // ... Surrogate low
/**/ /**/ Console.Write("U+{0:x6} ", highbits << 10 + (i - 0xDC00) + 0x10000);
}

// Unicode code point to string
string codePoint = "U+12345";
int ordinal = int.Parse(codePoint.substring(2), System.Globalization.NumberStyles.HexNumber);
if (ordinal < 0x10000)
/**/ Console.WriteLine((char) ordinal);
else
/**/ Console.WriteLine((char) ((ordinal - 0x10000) >> 10 + 0xD800), (char) ((ordinal - 0x10000) & 0x3FF + 0xDC00));
Comments are closed.

Disclaimer: The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.