Why the #AskObama Tweet was Garbled on Screen: Know your UTF-8, Unicode, ASCII and ANSI Decoding Mr. President
UPDATE: The contractor/vendor that made the software commented on Hacker News with more technical information. They're a very classy shop and have handled this REALLY minor gaffe very well, to their credit. I mean, let's put this into perspective, it's a fun nit, it's a weird thing that only we programmers understand, but ultimately what we can all agree on is Obama should outlaw Smart Quotes immediately.
The Speaker of the House of Representatives John Boehner tweeted this a few days ago. Note that this is not a political blog post.
After embarking on a record spending binge that’s left us deeper in debt, where are the jobs? #AskObama
During the #AskObama Live Twitter event, the Tweets then came up on a big Plasma screen. This tweet came up "garbled" and said:
After embarking on a record spending binge thatâ€™s left us deeper in debt, where are the jobs? #AskObama
And a million programmers, regardless of political party, groaned in unison. First, because someone screwed up their UTF-8 decoding, by not doing it, and second, because our President doesn't recognize a text encoding bug when he sees one! Well, maybe that second one was just me, but still. Tragic. The President then teased the Speaker for his typing while newspapers and news organizations struggled to get their minds around this "garbled tweet."
Well, Boehner could have tweeted "that's left us deeper..." but he tweeted "that’s." Note the "smart" apostrophe. He used Tweetdeck to tweet it, and it was likely on a Mac. It's also possible that he wrote the tweet in Microsoft Word then copy pasted it as Word loves to change quotes and apostrophes ' into smart quotes and smart apostrophes with direction like this ’.
When I ask for this timeline, the HTTP Headers say it's encoded as "UTF-8", see?
Content-Type: application/json; charset=utf-8
I blogged about the "Importance of being UTF-8" about five years ago. If you look at the JSON and find the tweet with the ID 88618213008621568, you can see the raw text encoded in JSON:
"text":"After embarking on a record spending binge that\u2019s left us deeper in debt, where are the jobs?"
See that \u2019? In Windows (you have this program even if you aren't a developer) go to the Start Menu and run "Charmap." Look around and you can see U+2019 is Right Single Quotation Mark. Note that it's WAY down in the list of all the characters. It's not a basic character like A to Z or a to z. It's one of those special things that looks nice, but causes trouble later.
If I make a text file in Notepad that looks like this and name it text.txt, for example, and Save As, making sure to use UTF-8 as the encoding...
After embarking on a record spending binge that’s left us deeper in debt, where are the jobs?
...then load it into any free HEX editor (or even an online one!) I see this:
Note that the part where the ’ was is actually three full bytes! E2 80 99.
Well, UTF-8 is an encoding whose goal was to not only support a bajillion different characters but also to be backwards compatible with ASCII, the American Standard Code for Information Interchange. If it wasn't, we wouldn't be able to see MOST of the characters in this tweet! In this case, just the ’ is goofy.
The code point was U+2019, which is 0010 0000 0001 1001, says Windows Calculator in Programmer Mode. You have this too, Dear Reader. There's some variable width encoding going on, that you can read about on Wikipedia.
This value of U+2019 expands to: 0010 0000 0001 1001, as I said, which then expands acording to these rules
zzzzyyyy yyxxxxxx ->
Which gives us this
11100010 -> E2
10000000 -> 80
10011001 -> 99
hence, "that’s" is encoded as
74 68 61 74 E2 80 99 73
I've bolded the ’. Which then, read back in - this time as Extended ASCII (the ANSI Windows 1252 Code page) we get the ’ expanded:
Made it this far? Why didn't I just say "The software read in a UTF-8 encoded JSON stream of tweets and displayed it with an ANSI Windows Code Page 1252." Because that wouldn't be nearly as fun.
Either way, the company that did this for the White House definitely goofed up and should have tested this. This is SUCH a classic
sloppy programmer mistake that I'm disappointed to see it showcased so blatantly. I hope they (the vendor) feel a little bad. The company appears to be called "Mass Relevance" and here's some news articles about Mass Relevance and their "Tweet Curation."
Testing, testing, testing, my friends. And not only testing, but KNOW this stuff. They don't always teach it in schools and no one will learn until they see their bug on national TV in front of the President of the United States. ;)
UPDATE: The vendor said this in the comments. Very well said.
"It was definitely a mistake on our part. The problem was not the encoding on our data feed, but the HTML document was sent with ISO-8859-1. The second we inserted the twitter text into the DOM, the browsers interpreted the UTF-8 string as ISO-8859-1. Our visualizations are hosted on other platforms, and in this case the server was not configured to send UTF-8 with text/html even though the HTML file was encoded as such. It was the only issue (albeit a pretty obvious one) during an otherwise flawless event. I apologize to President Obama, Speaker Boehner, and Jack Dorsey for the mistake. If the readers of the blog think it was stupid, imagine how we felt. dev environment != production environment. If we would have just included a <meta charset="utf-8"> in the HTML head, then this would not have occurred.
The big take away is don’t make assumptions about other platforms (especially when it comes to encoding), and always include charset meta tag."
Text encoding is fun for all ages. Enjoy!
* Like this post? Put me on TV, folks. This is the kind of stuff that a real technology journalist *Pogue* would love to share with the people! ABC News? I'm available and I have Skype. Call my people. ;)