Scott Hanselman

Why the #AskObama Tweet was Garbled on Screen: Know your UTF-8, Unicode, ASCII and ANSI Decoding Mr. President

July 7, '11 Comments [54] Posted in Internationalization | Musings
Sponsored By

The Washington Post and the Garbled TweetUPDATE: The contractor/vendor that made the software commented on Hacker News with more technical information. They're a very classy shop and have handled this REALLY minor gaffe very well, to their credit. I mean, let's put this into perspective, it's a fun nit, it's a weird thing that only we programmers understand, but ultimately what we can all agree on is Obama should outlaw Smart Quotes immediately.

The Speaker of the House of Representatives John Boehner tweeted this a few days ago. Note that this is not a political blog post.

After embarking on a record spending binge that’s left us deeper in debt, where are the jobs? #AskObama

During the #AskObama Live Twitter event, the Tweets then came up on a big Plasma screen. This tweet came up "garbled" and said:

After embarking on a record spending binge that’s left us deeper in debt, where are the jobs? #AskObama

And a million programmers, regardless of political party, groaned in unison. First, because someone screwed up their UTF-8 decoding, by not doing it, and second, because our President doesn't recognize a text encoding bug when he sees one! Well, maybe that second one was just me, but still. Tragic. The President then teased the Speaker for his typing while newspapers and news organizations struggled to get their minds around this "garbled tweet."

Well, Boehner could have tweeted "that's left us deeper..." but he tweeted "that’s." Note the "smart" apostrophe. He used Tweetdeck to tweet it, and it was likely on a Mac. It's also possible that he wrote the tweet in Microsoft Word then copy pasted it as Word loves to change quotes and apostrophes ' into smart quotes and smart apostrophes with direction like this ’.

I can get John Boehner's User ID (not his twitter name, but the number that represents John) with this online tool http://www.idfromuser.com. I see that it's 5357812, so I can get his timeline as RSS (Really Simple Syndication)/XML like this: http://twitter.com/statuses/user_timeline/5357812.rss or JSON (JavaScript Object Notation) like this http://twitter.com/statuses/user_timeline/5357812.json 

When I ask for this timeline, the HTTP Headers say it's encoded as "UTF-8", see?

Content-Type: application/json; charset=utf-8

I blogged about the "Importance of being UTF-8"  about five years ago. If you look at the JSON and find the tweet with the ID 88618213008621568, you can see the raw text encoded in JSON:

"text":"After embarking on a record spending binge that\u2019s left us deeper in debt, where are the jobs?"

See that \u2019? In Windows (you have this program even if you aren't a developer) go to the Start Menu and run "Charmap." Look around and you can see U+2019 is Right Single Quotation Mark. Note that it's WAY down in the list of all the characters. It's not a basic character like A to Z or a to z. It's one of those special things that looks nice, but causes trouble later.

Character Map

If I make a text file in Notepad that looks like this and name it text.txt, for example, and Save As, making sure to use UTF-8 as the encoding...

After embarking on a record spending binge that’s left us deeper in debt, where are the jobs?

...then load it into any free HEX editor (or even an online one!) I see this:

The Tweet in a Hex Editor - E2, 80, and 99 are highlighted

Note that the part where the ’ was is actually three full bytes! E2 80 99.

Well, UTF-8 is an encoding whose goal was to not only support a bajillion different characters but also to be backwards compatible with ASCII, the American Standard Code for Information Interchange. If it wasn't, we wouldn't be able to see MOST of the characters in this tweet! In this case, just the ’ is goofy.

The code point was U+2019, which is 0010 0000 0001 1001, says Windows Calculator in Programmer Mode. You have this too, Dear Reader. There's some variable width encoding going on, that you can read about on Wikipedia.

This value of U+2019 expands to: 0010 0000 0001 1001, as I said, which then expands acording to these rules

zzzzyyyy yyxxxxxx ->
1110zzzz
10yyyyyy
10xxxxxx

Which gives us this

11100010 -> E2
10000000 -> 80
10011001 -> 99

hence, "that’s" is encoded as

74 68 61 74 E2 80 99 73

I've bolded the ’. Which then, read back in - this time as Extended ASCII (the ANSI Windows 1252 Code page) we get the ’ expanded:

that’s

Made it this far? Why didn't I just say "The software read in a UTF-8 encoded JSON stream of tweets and displayed it with an ANSI Windows Code Page 1252." Because that wouldn't be nearly as fun.

Either way, the company that did this for the White House definitely goofed up and should have tested this. This is SUCH a classic sloppy programmer mistake that I'm disappointed to see it showcased so blatantly. I hope they (the vendor) feel a little bad. The company appears to be called "Mass Relevance" and here's some news articles about Mass Relevance and their "Tweet Curation."

Testing, testing, testing,  my friends. And not only testing, but KNOW this stuff. They don't always teach it in schools and no one will learn until they see their bug on national TV in front of the President of the United States. ;)

UPDATE: The vendor said this in the comments. Very well said.

"It was definitely a mistake on our part. The problem was not the encoding on our data feed, but the HTML document was sent with ISO-8859-1. The second we inserted the twitter text into the DOM, the browsers interpreted the UTF-8 string as ISO-8859-1. Our visualizations are hosted on other platforms, and in this case the server was not configured to send UTF-8 with text/html even though the HTML file was encoded as such. It was the only issue (albeit a pretty obvious one) during an otherwise flawless event. I apologize to President Obama, Speaker Boehner, and Jack Dorsey for the mistake. If the readers of the blog think it was stupid, imagine how we felt. dev environment != production environment. If we would have just included a <meta charset="utf-8"> in the HTML head, then this would not have occurred.

The big take away is don’t make assumptions about other platforms (especially when it comes to encoding), and always include charset meta tag."

Mass Relevance

Text encoding is fun for all ages. Enjoy!

* Like this post? Put me on TV, folks. This is the kind of stuff that a real technology journalist *Pogue* would love to share with the people! ABC News? I'm available and I have Skype. Call my people. ;)

About Scott

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. I am a failed stand-up comic, a cornrower, and a book author.

facebook twitter subscribe
About   Newsletter
Sponsored By
Hosting By
Dedicated Windows Server Hosting by ORCS Web
Thursday, July 07, 2011 8:48:48 PM UTC
need to s/2010/2019/ AFAICT?

you can see U+2010 is Right Single Quotation Mark.
Thursday, July 07, 2011 8:51:35 PM UTC
Nice detective work! Oh and a good quick tutorial on the "Importance of UTF-8"
Thursday, July 07, 2011 8:52:41 PM UTC
No. Check the status bar of CharMap in the screenshot. U+2010 is an uncommonly used hyphen character.
Thursday, July 07, 2011 9:02:46 PM UTC
Very informative post. I love it. Thanks a lot!
FYI - The link to "Importance of being UTF-8" is dead.
non
Thursday, July 07, 2011 9:08:12 PM UTC
I applaud you sir.
Thursday, July 07, 2011 9:09:36 PM UTC
Scott, James is referring to your typo in your article text where you have
Look around and you can see U+2010 ...
Thursday, July 07, 2011 9:10:08 PM UTC
I find it amusing that the "Tweet" button produces a 146 character twitter status. Know your title length Mr Hanselman!
Thursday, July 07, 2011 9:36:37 PM UTC
I had a related tweet just the other day about this:
https://twitter.com/#!/travis/status/86851708755513344
Thursday, July 07, 2011 9:36:54 PM UTC
Philip: Thanks - I should have done a blockquote to make it more obvious that I was referring to text from the post. :)
Thursday, July 07, 2011 9:41:45 PM UTC
It was definitely a mistake on our part. The problem was not the encoding on our data feed, but the HTML document was sent with ISO-8859-1. The second we inserted the twitter text into the DOM, the browsers interpreted the UTF-8 string as ISO-8859-1. Our visualizations are hosted on other platforms, and in this case the server was not configured to send UTF-8 with text/html even though the HTML file was encoded as such. It was the only issue (albeit a pretty obvious one) during an otherwise flawless event. I apologize to President Obama, Speaker Boehner, and Jack Dorsey for the mistake. If the readers of the blog think it was stupid, imagine how we felt. dev environment != production environment. If we would have just included a <meta charset="utf-8"> in the HTML head, then this would not have occurred.

The big take away is don’t make assumptions about other platforms (especially when it comes to encoding), and always include charset meta tag.
Thursday, July 07, 2011 10:00:56 PM UTC
Oh, and you seem to have a little encoding escaping problem going on yourself ;)
I blogged about the "&lt;a href="http://The%20Importance%20of%20being%20UTF-8"&gt;Importance of being UTF-8&lt;/a&gt;"&nbsp; about five years ago.


Except for the URL which is incorrect (should've been "/blog/TheImportanceOfBeingUTF8.aspx") the quotes (i.e. ") should've been &quot;'s. Also the live preview has trouble with < and &lt; so I'm crossing my fingers when posting this comment...
Thursday, July 07, 2011 10:01:33 PM UTC
Nice post, Scott!

I guess that US developers are a bit less used to handle strings with characters outside of the "common area" between charsets and encodings (0-9, A-Z, a-z, punctuation and such), as there are no diacritical marks (accents and similar) in your everyday language. The problem you describe here only surfaced due to the use of "nice typography" by some word processor (or OS).

In Portugal (where I live), almost any sentence in Portuguese has at least an accented character, and so it's much easier to face a test case where this exact problem arises. However, I should note that not many people in my team are able to tell what's going on at the byte level, and why sometimes these strange characters come up in some given tool output.

I'm about to make a presentation to my team at work about encodings... I really hope I can help to clear up some doubts, and I believe that this will be a perfect example of why the proper handling of character encodings can be so important in a project.
Pedro Sousa
Thursday, July 07, 2011 10:01:38 PM UTC
I blogged about the "Importance of being UTF-8" -----This LINK is dead
rugved
Thursday, July 07, 2011 10:04:19 PM UTC
Crap; it messed up. The &lt; and &gt; in the blockquote should've read < and >. I can't find a way to post a (readable) url without it messing up or creating an actual link:

Hanselman's Blog
&lt;a href="http://hanselman.com"&gt;Hanselman's Blog&lt;/a&gt;

Only way is to put spaces... That's soooooo 1993 :-P
< a href="http://hanselman.com">Hanselman's Blog< /a>
Thursday, July 07, 2011 10:31:16 PM UTC
Uh, John Boehner is Speaker of the House of Representatives, not a Senator.

Politics and Tech -- how little each side understands the other....
Thursday, July 07, 2011 11:06:13 PM UTC
Including <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/> is like the first thing you do when starting front-end development. Every time you don't validate...God kills a kitten. Please, think of the kittens.

This has been a Public Service Announcement. Please code responsibly: http://validator.w3.org/
Thursday, July 07, 2011 11:06:51 PM UTC
Nice blog post. As Pedro Sousa has said it, in the english speaking world you don't often see such non-ascii characters.
I'm a german developer and i have banged my head millions of times because the wrong character-sets on the database, the text and the browser.
The ä, ö, ü, and the ß are really painful in german texts and even in my name! Only one characterset in the world would be the best. Praise to the Unicode.
Thursday, July 07, 2011 11:30:56 PM UTC
Why send curly quotes in the first place, though? Everyone should know by now that it'll cause a problem somewhere.

More importantly: Was that what was actually typed, or did Office or the Mac or whatever "help" by changing from something small and compatible to something large and incompatible? "Smart quotes" just aren't.
Thursday, July 07, 2011 11:37:47 PM UTC
@TVD: XHTML? Really? ;-)

Great write-up, thanks!
I was going to refer to Joel's article, but RobIII (hi rob!) beat me to it.
Rob
Thursday, July 07, 2011 11:50:13 PM UTC
What's the tradeoff for sending Content-Type by HTTP over <meta http-equiv>? I'd prefer forcing UTF-8 on for all pages and fix the broken ones (either by fixing the content or adding <meta>) rather than trusting every page to get it right.
Friday, July 08, 2011 12:03:34 AM UTC
One of the easiest explanations of UTF-8, I've encountered. Well done!
Friday, July 08, 2011 12:13:38 AM UTC
☑ Excellent explanation
Friday, July 08, 2011 12:47:12 AM UTC
Handy reference: http://vazor.com/unicode/c2019.html
Friday, July 08, 2011 12:53:07 AM UTC
Why send curly quotes in the first place, though? Everyone should know by now that it'll cause a problem somewhere.


Because as developers, it is our job to make tools that work, no matter what. Real world text is not limited to ASCII either.
hoopz
Friday, July 08, 2011 2:08:16 AM UTC
@Simon: Joel Spolsky covers HTTP headers vs. HTML meta Http-Equiv in the blog post that @RobIII linked (a great read, as usual for Joel).
Friday, July 08, 2011 2:20:00 AM UTC
@David: Huh, thanks for the tip! I must have forgotten that - It's been a while since I read it :)
Friday, July 08, 2011 2:51:44 AM UTC
The company responsible seems to have responded in the comments:

"It was definitely a mistake on our part. The problem was not the encoding on our data feed, but the HTML document was sent with ISO-8859-1. The second we inserted the twitter text into the DOM, the browsers interpreted the UTF-8 string as ISO-8859-1. Our visualizations are hosted on other platforms, and in this case the server was not configured to send UTF-8 with text/html even though the HTML file was encoded as such. It was the only issue (albeit a pretty obvious one) during an otherwise flawless event. I apologize to President Obama, Speaker Boehner, and Jack Dorsey for the mistake. If the readers of the blog think it was stupid, imagine how we felt. dev environment != production environment. If we would have just included a <meta charset="utf-8"> in the HTML head, then this would not have occurred.
The big take away is don’t make assumptions about other platforms (especially when it comes to encoding), and always include charset meta tag." [emphasis mine]
ved
Friday, July 08, 2011 3:11:44 AM UTC
"Smart" apostrophes and quotation marks are the very face of Satan in the material world. Whatever numbnut thought them up should be dragged in manacles to the Hague to answer for his crimes.

And I say that as a writer with no dev background or responsibilities.
Jason Toon
Friday, July 08, 2011 5:08:34 AM UTC
The real problem here is that the politician who supposedly made the tweet had to outsource the event (a friggin' tweet) to a private contractor.
Friday, July 08, 2011 6:12:41 AM UTC
hoopz - That wasn't the question.
I wasn't asking why anyone should bother fixing character encoding (they should), I wanted to know whether the intent of the author was presumptuously altered in a destructive way.
Friday, July 08, 2011 7:37:40 AM UTC
Nice post Scott. Encoding problems still come up these days... And the sad part is, it's not that hard to do it right.

Also check out this blog post about encoding :The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Sander
Friday, July 08, 2011 7:38:13 AM UTC
Fixed a few of the errors pointed out. Thanks folks!
Friday, July 08, 2011 10:22:44 AM UTC
See it every day :) http://twitpic.com/5mxdsp
Friday, July 08, 2011 11:49:16 AM UTC
Hey Scott thanks fro your valuble writing . Appreciate your work.
Friday, July 08, 2011 12:12:23 PM UTC
@TVD: did you mean <meta charset="utf8" /> ?

XHTML is so 1999.
Friday, July 08, 2011 12:45:39 PM UTC
In other words:
Console.WriteLine(Encoding.Default.GetString(Encoding.UTF8.GetBytes("’")));

:-)
Friday, July 08, 2011 2:53:25 PM UTC
I think brianary and a couple others really have the right idea.

This is just one among thousands (at least) of problems caused by MS Word (and likely other apps, but that's the big offender) doing this character replacement in the first place.

The funny thing is, it's already an option to toggle this behavior, it's just that the default is to do the replacement.

http://office.microsoft.com/en-us/word-help/change-curly-quotes-to-straight-quotes-and-vice-versa-HA010173242.aspx

As we've learned over the years, DEFAULTS MATTER. If we were to ask anyone involved in this tweet whether they ever intended a 'smart quote' in the sentence instead of a normal apostrophe (you know, that character on the keyboard they actually hit), I'm 100% sure none of them would have said "yes, it was important to have a 'smart quote' included!". More importantly, I'm equally confident that if the tweet creator(s) understood character sets *and* that an app was or had made the change, they would have said 'um, no thanks, keep the regular apostrophe that works in most every character set just fine'

This is the kind of decision where, even if you think smart quotes look better, the marginal cost to the users in the resultant pain has FAR outweighed any marginal style benefit.

Yes, it'd be great to live in a world where everything handled utf-8 just great, and certainly any string-handling apps should throw in some testing with 'odd' characters (esp. these smart quotes, ugh), but IMHO it's important to identify the root cause here as the apps that make such changes without the user asking for it, and the vast majority of the time, not even noticing the replacement happened.

Sure, fixing vNext of Word et al won't really fix this problem, but I think it's just as important of a lesson to take away (liberal on input, conservative on output, where smart quotes violate the latter :) as 'you should test with utf-8.

</rant> :)
Friday, July 08, 2011 4:55:19 PM UTC
The geek in me enjoyed the explanation and smiled...thanks!
Punit
Friday, July 08, 2011 5:11:23 PM UTC

"Smart" apostrophes and quotation marks are the very face of Satan in the material world. Whatever numbnut thought them up should be dragged in manacles to the Hague to answer for his crimes.

And I say that as a writer with no dev background or responsibilities.


*Sigh*

I really wish developers and writers (who should know better) would stop calling them "smart" or "curly" apostrophes/quotation marks. They're simply apostrophes and quotation marks. Real ones. Actual ones. The characters ' and " are not actually apostrophes and quotation marks. They're the result of engineers who made the typewriter trying to save space by combining real apostrophes/quotes with primes. They come from a technical limitation in the technology at the time, not from typography (just like the double space after a period).

Primes are used for units of measure. For example 6′ 2″ (shown with real primes).
Friday, July 08, 2011 5:58:28 PM UTC
Two things I just encountered with this:

1) Yesterday I took a code sample from the web and pasted it into SSRS's function evaluator and I was confused as to why the text box was red-underlining a string literal. Turns out the code sample was using "smart" left-right quotes.

2) I just saw my list of pod-casts from "This American Life", and the title on the web:

"Father's Day 2011"

Appears as this on iTunes's track listing:

"#438: Father&#039;s Day 2011"

Although now that I think of it - I think that is just a form of HTML encoding that went out as plain text...

Text is hard.
Sunday, July 10, 2011 12:58:59 PM UTC
Typographical correctness gone mad.

I wonder whether John Boehner went out of his way to insert the "proper" apostrophe. Is he someone who has memoried the keyboard shortcuts? What a thought! (Well, why not? We're not widdling about with typewriters anymore. This is not some bloaty feature of Microsoft Word - when type was set by hand with metal blocks and things, we had distinct opening and closing quotation marks, and it's a bit more pleasant to read. In a lot of languages, they use « Guillemets ».)
Monday, July 11, 2011 5:24:10 PM UTC
Had this exact problem today. Extremely timely and was done fixing my issue in 5-10m. Thanks!
Monday, July 11, 2011 7:36:07 PM UTC
Great reminder for folks. This sort of stuff comes up again and again. The more teams I work with the more I realize my career is going to be a continuous cycle of explaining items like this to new developers. A lot of developers I meet have got some of the advanced stuff down but when it comes to basics like encoding or exception handling they really struggle.
Monday, July 11, 2011 10:59:27 PM UTC
Justin :

Note that the Unicode database labels character 0x0027 as "APOSTROPHE". Also, we call the curly quotes Smart Quotes, because that's what Word (the manufacturer of 90% of the world's curly quotes) calls them.

Sure, x2032/&prime; and x2033/&Prime; as you have typed are "PRIME" and "DOUBLE PRIME", but doesn't that undermine the point you were making, that x0027 is the prime character?

Yes, it's a combined-purpose character, and it'd be more attractive to many if we could use precise typography, but it isn't yet practical!
Just think of the keyboard that supported all EIGHT types of dash/hyphen! Would that even work on a phone?

In a general-purpose environment, I'd be happier to have an ‽ interrobang, ؟ irony mark, ⚠ warning, ☠ skull & xbones, ☡ caution, ☢ radioactive, ☣ biohazard, ☤ caduceus, and other practical symbols supported before I started putting too much energy into pure æsthetics. Can we have both? Even better.


Eight dashes/hyphens: ‧ hyphenation point (break letters/syllables) ‐ hyphen ⁃ hyphen bullet − minus ‒ figure dash (number separator) – en dash (range, "to") — em dash (parenthetical, break) ― quotation dash (horizontal bar)
Thursday, July 14, 2011 7:39:55 AM UTC
Read this post yesterday and got this in my inbox today (from sitepoint)


"Hello Ivar ├â╞È├óΓé¼┬ªsell,"


Funny coincidence

/ Ivar Åsell
Thursday, July 14, 2011 2:50:35 PM UTC
To everyone complaining about word entering the smart quote automatically, you are missing the forest for the trees.

The responsibility lies with the vendor that created this system. They had absolutely all the information they needed to display it correctly and didn't. Would you have complained that everyone needs to learn to write English if a tweet in a different language had come through garbled?

Furthermore, MS Word is a tool for writing documents where an actual apostrophe/quote/etc. is likely to matter. It was not created for editting tweet text where it probably doesn't. It was a smart default for their tool and audience.
Matt Cauthon
Thursday, July 14, 2011 3:17:12 PM UTC
Matt, no one is missing anything. Obviously, as I already said, it should have displayed correctly (if the correct encoding was identified).

The point I was making was that there is another problem: whether a tool presumptuously made a destructive change that the author did not request.

That Word default made sense in a less-connected, Word-centric world, but has caused nothing but trouble as systems are still learning to cope with character encoding, particularly when content is mixed together from multiple sources at different layers. Word certainly doesn't market itself as a narrowly-targeted tool—its very name suggests that it wants to be your gateway to all text entry.

Word isn't a silo. It has to play in a world still grappling with encoding.
Thursday, July 14, 2011 4:31:53 PM UTC
is this the same thing, but in Razor Views?
<a href="http://larud.net/Blog/archive/2011/07/11/razor-view-engine-amp-unicode.aspx" title"Razor View Engine & Unicode?">Razor View Engine & Unicode?</a>
Friday, July 15, 2011 10:56:01 AM UTC
http://www.indiatimes.com/picture-stories/mumbai-no-choice-but-to-move-on/Mumbais-diamond-district/photostory/9231403.cms

Chrome browser fails to render the same text in tab.

Thanks
Anuj Pandey
Monday, July 18, 2011 8:09:29 AM UTC
That was awesome!
Almost felt like reading a fairy-tale.
Thanks for the wonderful and in-depth narration!
Cheers!
Wednesday, July 27, 2011 5:52:43 PM UTC
Apparently this is known as "Mojibake".
Friday, November 18, 2011 7:04:36 AM UTC
@brianary:

"Word isn't a silo."

Pretty much every program aimed towards word processing or publishing uses curly quotes. So, no, Word isn't a silo - it behaves just as it's expected, given it's culture.

This isn't just about curly quotes, either - that's just a distraction, really. Developers that fail to render curly quotes will also fail to render accented characters. They'll fail to render mathematical symbols (the proper ones, not x/*, +, -, and /.) They'll fail to render Greek, Japanese, Korean, and so on. Should Japanese people type in Romanji to deal with "a world grappling with encoding"? Should Word convert all Japanese text to Romanji when it's copied because some websites aren't written with Japanese people in mind?

If it's purely aesthetic, why should it matter if people use it? Modern software and websites aren't in an ASCII-only silo either.
Tuesday, March 20, 2012 10:08:11 PM UTC
"If we would have just included a <meta charset="utf-8"> in the HTML head, then this would not have occurred." - c'mon, this can't be missed. When I was a junior developer I was likely to do such mistakes: this says a lot about their developers and testers.
Comments are closed.

Disclaimer: The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.