Scott Hanselman

CSI: Visual Studio - Unable to translate Unicode character at index X to specified code page

June 08, 2013 Comment on this post [17] Posted in Bugs
Sponsored By
A crazy internal error from Visual Studio

A customer emailed me a weird one. I tend to have a sense for when something is up and when an obscure thing will turn into something interesting.

The person says:

...mysteriously most of my projects refuse to build.  "The build stopped unexpectedly because of an internal failure... something about unicode... blah blah"

There are a few messages out there on the web about it -- even a really old hot fix.  What's the best way to proceed with the VS team / MS?  Is there anyone actively interested in glitches like this?

My spidey-sense is tingling. First, when something says "internal failure" it means some fundamental expectation wasn't met. Garbage in perhaps? He says "most of my projects" which implies it's not a specific project. There's also the sense that this is a "suddenly things stopped working" type thing. Presumably it worked before.

I say:

"Have you checked all the source files to make sure one isn't filled with Unicode nulls or something?"

And says no, but sends a call-stack (which is always nice when it's sent FIRST, but still):

Error    1    The build stopped unexpectedly because of an internal failure.
System.Text.EncoderFallbackException: Unable to translate Unicode character \uD97C at index 1321 to specified code page.
at System.Text.EncoderExceptionFallbackBuffer.Fallback(Char charUnknown, Int32 index)
at System.Text.EncoderFallbackBuffer.InternalFallback(Char ch, Char*& chars)
at System.Text.UTF8Encoding.GetByteCount(Char* chars, Int32 count, EncoderNLS baseEncoder)
at System.Text.UTF8Encoding.GetByteCount(String chars)
at System.IO.BinaryWriter.Write(String value)
at Microsoft.Build.BackEnd.NodePacketTranslator.NodePacketWriteTranslator.TranslateDictionary(Dictionary`2& dictionary, IEqualityComparer`1 comparer)
at Microsoft.Build.Execution.BuildParameters.Microsoft.Build.BackEnd.INodePacketTranslatable.Translate(INodePacketTranslator translator)
at Microsoft.Build.BackEnd.NodePacketTranslator.NodePacketWriteTranslator.Translate[T](T& value, NodePacketValueFactory`1 factory)
at Microsoft.Build.BackEnd.NodeConfiguration.Translate(INodePacketTranslator translator)
at Microsoft.Build.BackEnd.NodeProviderOutOfProcBase.NodeContext.SendData(INodePacket packet)
...

OK, so it doesn't like a character. But a character in WHAT? Well, we'd assume a source file, but it's important to remember that there's other pieces of input to a compiler like path names, environment variables, commands passed to the compiler as switches, etc.

It says Index 1321 which seems pretty far into a string before it gets mad. I asked a few people inside and Sara Joiner says:

It looks like the only place in BuildParameters that we call TranslateDictionary is when transferring the state of the environment [variables] across the wire. 

Ah, so this is splitting up name-value pairs that are the environment variables! David Kean says "ask him what his PATH looks like." I ask and I get almost 2000 bytes of PATH! It's a HUGE path, it looks like it may even have been duplicated and appended to itself a few times.

Here's just a bit of the PATH in question. See anything?

\;C:\PROGRA~1\DISKEE~1\DISKEE~1\;C:\Program Files (x86)\Windows Kits\8.0\Windows
Performance Toolkit\;C:\Program Files\Microsoft SQL
Server\110\Tools\Binn\;C:\Program Files\Microsoft\Web Platform
Installer\;C:\Program Files\TortoiseSVN\binVN\???p??;C:\Program
Files\TortoiseSVN\bin;C:\PHP\;C:\progra~1\NVIDIA
Corporation\PhysX\Common;C:\progra~2\Common Files\Microsoft Shared\Windows
Live;C:\progra~1\Common Files\Microsoft Shared\Windows
Live;C:\q\w32;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;
C:\Windows\System32\WindowsPowerShell\v1.0\;C:\progra~2\WIDCOMM\Bluetooth
Software\;C:\progra~2\WIDCOMM\Bluetooth

See those ??? marks? That doesn't feel like question marks to me. I open the result of "SET > env.txt" as a binary file in Visual Studio and it looks like it's 3Fs, which are ? marks.

I think the text file was converted to ANSI

This makes me think that there's unicode goo in the PATH that was converted to ANSI with it was piped. Phrased differently, this text file isn't reality.

However, elsewhere in the Windows UI his PATH variable looks like different.

C:\Program Files\TortoiseSVN\binVN\�侱ᤣp䥠؉;

Sometimes that corruption in the path looks like this and you might assume it's Chinese. No, it's corruption that's getting interpreted as Unicode. Interestingly the error said the naughty character was 0xD97C which is &#0xD97C; � which implies to me that something got stripped out at some point in processing and turned into the Unicode equivalent of 'uh...' Regardless, it's wrong and it needs to be removed.

I ask him if cleaning his PATH worked and the customer just send me a one line response via email...the best kind of response:

========== Build: 12 succeeded, 0 failed, 0 up-to-date, 0 skipped ==========

Yay! I hope this helps the next person who goes aGoogling for the answer and thought they were alone. Thanks to David Kean, Sara Joiner and Srinivas Nadimpalli for looking at the call stack and guessing at solutions with me!

Any insights, Dear Reader?


Sponsor: Big thanks to RedGate for sponsoring the feed this week! Check out Deployment Manager – app deployment without the stress. Deploy .NET code & SQL Server databases in one simple process from a web-based UI. Works with local, remote and cloud servers. Try it free.

About Scott

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.

facebook twitter subscribe
About   Newsletter
Hosting By
Hosted in an Azure App Service
June 08, 2013 5:09
Good detective work. It would have been nice if the exception showed the entire string rather than the offending character.
June 08, 2013 5:47
0xD97C is a the first word of a UTF-16 surrogate pair, so perhaps the code barfing on this input doesn't correctly handle Unicode characters above U+FFFF. Another option is the second word of the surrogate pair could be missing and the code doesn't handle invalid UTF-16.
June 08, 2013 7:16
Mark! That must be it. Nicely done. I thought it might be the first of a pair but was googling UTF8, not UTF16, a brain fart on my part. Thanks!
June 08, 2013 10:11
Just a guess, but I suspect a program/installer that's attempted to edit PATH by messing with the registry, and it's miscalculated the buffer size, and managed to use an ANSI-sized buffer with a Unicode API or vice versa.
June 08, 2013 10:15
There are a few programs out there that rewrite %PATH% to include themselves and, in so doing, replace environment variables with hard-coded paths and muck up Unicode entries. It pays to make a habit of checking %PATH% after running an installer.
June 08, 2013 11:09
There are lots of installers that screw up %path%. The Delphi installer is a huge offender, since it actually *wipes out the path completely*.

Fortunately, that's what ControlSet### are for.
K
June 08, 2013 11:10
(Oh, and I second the call for all error messages to include EXACTLY WHAT PARAMETER CAUSED THE FAILURE IN THE FIRST PLACE. In order to determine the failure, you had to fail processing [something], why are we never shown what [something] was? In this case, it would've saved an email... and would've saved a bunch of other people what was likely days/weeks of frustration.)
K
June 08, 2013 15:17
"Any insights, Dear Reader?"

Yes. More QA.
June 08, 2013 19:13
I have actually come across this in the past, but until I read your full response I wouldn't have remembered. Well done...
Mel
June 10, 2013 9:26
Is it really acceptable that a weird character in your path will crash the build? "Garbage in, garbout out" is often another way of saying "I blame the user for my bugs".
June 10, 2013 10:41
@Henri

Yes. If the path is garbage the build should fail. Because there is no way for the system to know if it's creating the correct output when part of the path is unreadable.

Otoh the error message should state clearly that the path is the one to blame.
June 10, 2013 19:36
If only we could attach a debugger at build time!
June 10, 2013 20:45
Few (pintless :-) points:

1. My first suspect is TortoiseSVN.
The problem shows in between two TortoiseSVN entries, and the binVN seems to suggest the traces of yet another SVN before.

    TortoiseSVN\binVN\???p??;C:\Program Files\TortoiseSVN\bin
Probably an uninstall followed by a new install. The uninstall tried to remove the previous entry, then a new install added one.
Circumstancial evidence, I agree. But many applications ported from Linux (or Linux developers) tend have problems with "char* is UTF-8" and "what the heck is this wchar_t abomination"

2. 0xD97C is indeed half a surrogate pair, but I would argue that nothing should not fail.
The path is valid from the file system perspective (as NTFS is not surrogate-aware). I can really have that thing on disk, and things would run from there.
And for any application working with it using only wchar_t it would be no problem either. The only reason for this failure is that VS tried to convert it to UTF-8 (see UTF8Encoding).
The result is invalid? Yes and no. Application A should not fail because application B did something stupid. Normal path processing would be: split at ;, check each entry and see if it exists on disk, and if I can run something from there, ignore if not.

3. "character was 0xD97C which is &#0xD97C; �"
Nope, � is U+FFFD, used as replacement character when one encountered an invalid Unicode sequence (like for instance a broken surrogate pair :-)
http://www.fileformat.info/info/unicode/char/fffd/index.htm
So the text pasted in this blog went through yet another application that was UTF-16 aware.
June 10, 2013 23:44
Mihai - I found D97C here as ? http://everythingfonts.com/unicode/0xD97C
June 11, 2013 2:27
running MSBuild /debug in debug mode.

Environmental variable from locals: Environment Count = 55

In theory this could have happened in any of those places. PATH is a likely choice. But it could have been anywhere. Without the debugger it would be a needle in the haystack.
June 11, 2013 2:39
@Scott

Don't let the look decieve you :-)
Always go copy/paste or even better, bypass that (if you can), and hex dump.
If you do that with the character in your post, it is FFFD.
For instance

  • curl -o badCp.html "http://www.hanselman.com/blog/CSIVisualStudioUnableToTranslateUnicodeCharacterAtIndexXToSpecifiedCodePage.aspx"

  • hexdump -C badCp.html | less


Shows that the the bytes after binVN\ are <EF BF BD>
That really is U+FFFD (see http://www.fileformat.info/info/unicode/char/fffd/index.htm)

http://everythingfonts.com tries to rended things as Unicode text, but the text is not valid, and various browsers will react to that in various ways (it is very likely that we don't even see the same thing :-)
So the browser takes U+D97C and tries to render it, but that is invalid stand-alone surrogate (should be pairs of high/low), so it uses U+FFFD instead. Take a look at anything between http://everythingfonts.com/unicode/0xD800 and http://everythingfonts.com/unicode/0xDFFF (the surrogate range) and you will see the same thing, a "black diamond with a white question mark". Same as http://everythingfonts.com/unicode/0xFFFD.

I usually recomend http://www.fileformat.info, as it uses SVGs to render the characters.
Otherwise there are too many layers trying to "fix" things (browser, OS text engine, ...)
See in this case http://www.fileformat.info/info/unicode/char/d97c/index.htm (it also says "U+D97C is not a valid unicode character.")
And http://www.fileformat.info/info/unicode/char/fffd/index.htm (the "real black diamond with white question mark" :-)

Mihai
June 17, 2013 18:32
Great post to get my morning going.

Comments are closed.

Disclaimer: The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.