Scott Hanselman

Why the #AskObama Tweet was Garbled on Screen: Know your UTF-8, Unicode, ASCII and ANSI Decoding Mr. President

July 7, '11 Comments [54] Posted in Internationalization | Musings
Sponsored By

The Washington Post and the Garbled TweetUPDATE: The contractor/vendor that made the software commented on Hacker News with more technical information. They're a very classy shop and have handled this REALLY minor gaffe very well, to their credit. I mean, let's put this into perspective, it's a fun nit, it's a weird thing that only we programmers understand, but ultimately what we can all agree on is Obama should outlaw Smart Quotes immediately.

The Speaker of the House of Representatives John Boehner tweeted this a few days ago. Note that this is not a political blog post.

After embarking on a record spending binge that’s left us deeper in debt, where are the jobs? #AskObama

During the #AskObama Live Twitter event, the Tweets then came up on a big Plasma screen. This tweet came up "garbled" and said:

After embarking on a record spending binge that’s left us deeper in debt, where are the jobs? #AskObama

And a million programmers, regardless of political party, groaned in unison. First, because someone screwed up their UTF-8 decoding, by not doing it, and second, because our President doesn't recognize a text encoding bug when he sees one! Well, maybe that second one was just me, but still. Tragic. The President then teased the Speaker for his typing while newspapers and news organizations struggled to get their minds around this "garbled tweet."

Well, Boehner could have tweeted "that's left us deeper..." but he tweeted "that’s." Note the "smart" apostrophe. He used Tweetdeck to tweet it, and it was likely on a Mac. It's also possible that he wrote the tweet in Microsoft Word then copy pasted it as Word loves to change quotes and apostrophes ' into smart quotes and smart apostrophes with direction like this ’.

I can get John Boehner's User ID (not his twitter name, but the number that represents John) with this online tool http://www.idfromuser.com. I see that it's 5357812, so I can get his timeline as RSS (Really Simple Syndication)/XML like this: http://twitter.com/statuses/user_timeline/5357812.rss or JSON (JavaScript Object Notation) like this http://twitter.com/statuses/user_timeline/5357812.json 

When I ask for this timeline, the HTTP Headers say it's encoded as "UTF-8", see?

Content-Type: application/json; charset=utf-8

I blogged about the "Importance of being UTF-8"  about five years ago. If you look at the JSON and find the tweet with the ID 88618213008621568, you can see the raw text encoded in JSON:

"text":"After embarking on a record spending binge that\u2019s left us deeper in debt, where are the jobs?"

See that \u2019? In Windows (you have this program even if you aren't a developer) go to the Start Menu and run "Charmap." Look around and you can see U+2019 is Right Single Quotation Mark. Note that it's WAY down in the list of all the characters. It's not a basic character like A to Z or a to z. It's one of those special things that looks nice, but causes trouble later.

Character Map

If I make a text file in Notepad that looks like this and name it text.txt, for example, and Save As, making sure to use UTF-8 as the encoding...

After embarking on a record spending binge that’s left us deeper in debt, where are the jobs?

...then load it into any free HEX editor (or even an online one!) I see this:

The Tweet in a Hex Editor - E2, 80, and 99 are highlighted

Note that the part where the ’ was is actually three full bytes! E2 80 99.

Well, UTF-8 is an encoding whose goal was to not only support a bajillion different characters but also to be backwards compatible with ASCII, the American Standard Code for Information Interchange. If it wasn't, we wouldn't be able to see MOST of the characters in this tweet! In this case, just the ’ is goofy.

The code point was U+2019, which is 0010 0000 0001 1001, says Windows Calculator in Programmer Mode. You have this too, Dear Reader. There's some variable width encoding going on, that you can read about on Wikipedia.

This value of U+2019 expands to: 0010 0000 0001 1001, as I said, which then expands acording to these rules

zzzzyyyy yyxxxxxx ->
1110zzzz
10yyyyyy
10xxxxxx

Which gives us this

11100010 -> E2
10000000 -> 80
10011001 -> 99

hence, "that’s" is encoded as

74 68 61 74 E2 80 99 73

I've bolded the ’. Which then, read back in - this time as Extended ASCII (the ANSI Windows 1252 Code page) we get the ’ expanded:

that’s

Made it this far? Why didn't I just say "The software read in a UTF-8 encoded JSON stream of tweets and displayed it with an ANSI Windows Code Page 1252." Because that wouldn't be nearly as fun.

Either way, the company that did this for the White House definitely goofed up and should have tested this. This is SUCH a classic sloppy programmer mistake that I'm disappointed to see it showcased so blatantly. I hope they (the vendor) feel a little bad. The company appears to be called "Mass Relevance" and here's some news articles about Mass Relevance and their "Tweet Curation."

Testing, testing, testing,  my friends. And not only testing, but KNOW this stuff. They don't always teach it in schools and no one will learn until they see their bug on national TV in front of the President of the United States. ;)

UPDATE: The vendor said this in the comments. Very well said.

"It was definitely a mistake on our part. The problem was not the encoding on our data feed, but the HTML document was sent with ISO-8859-1. The second we inserted the twitter text into the DOM, the browsers interpreted the UTF-8 string as ISO-8859-1. Our visualizations are hosted on other platforms, and in this case the server was not configured to send UTF-8 with text/html even though the HTML file was encoded as such. It was the only issue (albeit a pretty obvious one) during an otherwise flawless event. I apologize to President Obama, Speaker Boehner, and Jack Dorsey for the mistake. If the readers of the blog think it was stupid, imagine how we felt. dev environment != production environment. If we would have just included a <meta charset="utf-8"> in the HTML head, then this would not have occurred.

The big take away is don’t make assumptions about other platforms (especially when it comes to encoding), and always include charset meta tag."

Mass Relevance

Text encoding is fun for all ages. Enjoy!

* Like this post? Put me on TV, folks. This is the kind of stuff that a real technology journalist *Pogue* would love to share with the people! ABC News? I'm available and I have Skype. Call my people. ;)

About Scott

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.

facebook twitter subscribe
About   Newsletter
Sponsored By
Hosting By
Dedicated Windows Server Hosting by SherWeb

JavaScript is Assembly Language for the Web: Sematic Markup is Dead! Clean vs. Machine-coded HTML

July 6, '11 Comments [136] Posted in ASP.NET | Musings
Sponsored By

UPDATE: Some folks think that saying "JavaScript is Assembly Language for the Web" is a totally insane statement. So, I asked a few JavaScript gurus like Brendan Eich (the inventor of JavaScript) and Douglas Crockford (inventor of JSON) and Mike Shaver (Technical VP at Mozilla). Their comments are over in this follow up blog post.

I was talking to Erik Meijer yesterday and he said:

JavaScript is an assembly language. The JavaScript + HTML generate is like a .NET assembly. The browser can execute it, but no human should really care what’s there. - Erik Meijer

This discussion started because I was playing with Google+ and, as with most websites that I'm impressed with, I immediately did a View Source to see what was underneath. I was surprised. I saw this:

That's a hell of a lot of markup

Let's just say that this went on for about 1300 lines. It is tight and about 90k. This is just the first half. It's mostly minified JavaScript. The middle part of the page is all spans and divs and generated class ids like this:

A case of the divs

Oy. The whole page is a big GUID. 

However, I see this on http://msn.com, http://www.bing.com, http://www.facebook.com and on and on. Even http://www.twitter.com is starting to "tighten" up a bit. All large sites appear to care not one bit about the aesthetics of their markup. So why do we?

It works, and it works great. Many of Google's best properties have GWT behind them. Would you be more impressed if you did a View Source and found that it was not only pretty on the outside but also inside?

This seems a little ironic because it was just a few years ago when ASP.NET Developers were railing against ViewState. "It's so heavy" really means "I don't understand what it does." ViewState was (and is) a powerful enabler for a development methodology that gets folks developing on the web faster than before. This is not unlike other toolkits Google Web Toolkit (GWT). GWT isn't completely unlike Web Forms in its philosophy. From the GWT website:

Google Web Toolkit (GWT) is a development toolkit for building and optimizing complex browser-based applications. Its goal is to enable productive development of high-performance web applications without the developer having to be an expert in browser quirks, XMLHttpRequest, and JavaScript.

That seems like a very admirable philosophy, no?  You could even say (with apologizes and tongue in cheek):

"ASP.NET WebForms" is a development toolkit for building and optimizing complex browser-based applications. Its goal is to enable productive development of high-performance web applications without the developer having to be an expert in browser quirks, XMLHttpRequest, and JavaScript.

The intent of this post isn't to shine a light on WebForms or be a WebForms apologist. It's great for certain kinds of apps, just as GWT is great for certain types of of apps. What I want to focus on is that working with server-side toolkits could be argued as going against the alternate philosophy that the real joy of developing on the new web comes from clean jQuery JavaScript and clean, clear markup ala Razor or HAML. It all comes down to what level of abstraction you choose to play at.

Semantic markup will still be buried in there and things like http://schema.org are still very important, just don't expect the source of your favorite website to read like a well indented haiku anymore.

To be clear, minification and compression are orthogonal optimizations. I'm talking about simply not caring if the markup and script emitted to the client are pretty. If you don't care about the markup sent to the browser, only the result, how can this free us to develop in new ways that aren't confined to slinging markup and JS.  Ultimately, if it works great, who cares?

My question to you, Dear Reader, is why do you care what View Source looks like? Is HTML5 and JavaScript the new assembly language for the Web?

UPDATE for clarity:

The point is, of course, that no analogy is perfect. Of course JavaScript as a language doesn't look or act like ASM. But as an analogy, it holds up.

  • JavaScript is ubiquitous.
  • It's fast and getting faster.
  • Javascript is as low-level as a web programming language goes.
  • You can craft it manually or you can target it by compiling from another language.

If the tools - as a developer OR a designer - give you the control and the results you want, what do you care? I propose that neither Rails, nor ASP.NET nor GWT is 100% there.  Each has their issues, but I think the future of the web is a diminished focus on clean markup and instead a focus on on compelling user experiences combined with languages and tools that make the developers work enjoyable and productive.

What do you think, Dear Reader...Do you want your HTML and JavaScript abstracted away more? Or less?

UPDATE: I want to say this again to make sure folks really understand. There's two separate issues here. There's minification and general obfuscation of source, sure. But that's just the first. The real issue is JavaScript as a target language for other languages. GWT is a framework for writing Web Applications in *JAVA* where the resulting bytecode is *JAVASCRIPT.* GWT chooses a high level designed language (Java) over an organicaly grown one (HTML+JS) and treats the whole browser as a VM. The question - do we write assembly language or something higher level? Also, I realize now that Google+ was written with Closure, but the point remains valid.

About Scott

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.

facebook twitter subscribe
About   Newsletter
Sponsored By
Hosting By
Dedicated Windows Server Hosting by SherWeb

Scott's Wonderful Twitter Favorites - Link Roundup 1

July 1, '11 Comments [9] Posted in Blogging
Sponsored By

I realize that many (most) of you are not on Twitter. I am, however, on Twitter and I find it to be a joy. I have had a few complaints (just a few) because I tend to be random on Twitter. If you want only a stream of technical .NET resources, then don't follow me. However, if you want to follow the Whole Person, then please, join the fun.

The most wonderful part of Twitter is just letting it flow over you. I tend to discover lots of interesting and cool stuff. So much so that I've started "Favorite-ing" things I want to save for later. That means clicking the little Star icon. You can access any Twitter user's favorites by putting /favorites at the end of their URL, like http://twitter.com/shanselman/favorites. There are 3rd parties like FavStar that mine this information, both favorites (stars) and retweets (RTs) and then sort by popularity. You can see my most "popular" tweets here: http://favstar.fm/users/shanselman. These tend of be ones that folks want to save for later themselves.

For the people that aren't on Twitter (or aren't on it as much as I am) I thought I'd do a post each week or so with a roundup of the most awesome links I've come upon that week on Twitter. One stop shopping for awesome. I'll also add some commentary about why it's awesome. These aren't all development centric, but they were interesting to me.


shanselman: "I'm on Google+ on a 'Hangout' right now. It's basically ChatRoulette for nerds with invites. And pants."

I was playing with Google+ today and this was my first impression. I tweeted it and it was retweeted a few hundred times and was eventually a "Top Tweet" when you searched for "Google Plus" for several hours, which I thought was hilarious. You never know what throwaway tweet will be seen by thousands. I was happy so I favorite it.


harrison_ Can't wait to try Google+, because if anything Wave, Latitude, Buzz, Orkut, Jaiku & Dodgeball proves Google knows how to run social networks.

Twitter is a haven for sarcasm and snark. Love it.


XKCD, one of the web's most brilliant and insightful comics had this to say about Google Plus. I guess they just wanted something that wasn't Facebook. Found this via shawncdean.

On one hand, you'll never be able to convince your parents to switch. On the other hand, you'll never be able to convince your parents to switch!


clipperhouse Only just read this, stats on Fog Creek internships vs applicants: http://t.co/fCr3BH9

Some really interesting charts and graphs talking in great detail about how FogCreek software finds their new recruits, where they've had success and where they just haven't.

Pie Chart showing details on how Fog Creek finds recruits


miketaylr This new Google web fonts gallery is the bees knees: http://www.google.com/webfonts/v2

There are just hundreds of great free fonts here for you to use on your website.

Lots of free webfonts


shanselman: This is literally the funniest thing I've seen All Week. Tech Company Org Charts: http://twitpic.com/5ior9c

I found this on Facebook and tweeted it. As they say, "it's funny because it's true."

It was retweeted almost 900 times and the image on just my Twitpic of it has been seen over 33,000 times. Amazing. Someone named "manu" made it. If you know their site, let me know and I'll link to them!

UPDATE: Reader Matthew found Manu Cornet's blog. The original artist and source for this "Technology Company Organizational Charts" diagram blogs here!

333719040


subdigital: Ruby is slow, but at least it's open

Ben Scheirman said this and it resonated with me.  I'd like to help create something that is fast (ASP.NET) and open (???).


JosephHill: New office policy: Next time there is a fire alarm, (1) commit (2) pull (3) merge (4) push (5) exit through stairwell

Please, people, comment to source control BEFORE you save yourself from a fire! ;)


For more details on Twitter and my tips and tricks, I hope you'll enjoy these...

Related Links

About Scott

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.

facebook twitter subscribe
About   Newsletter
Sponsored By
Hosting By
Dedicated Windows Server Hosting by SherWeb

Hanselminutes Podcast 272 - Basics of Web Security with Barry Dorrans

June 29, '11 Comments [0] Posted in ASP.NET | Podcast
Sponsored By

3237164755_e34da6809e Scott sits down with Microsoft Security Engineer Barry Dorrans to get a general sense of the basics of Web Security in 2011. Who are the groups in the news most often? What threats are nailing websites most often today, and are they different from classic threats? Where do we start to protect our sites?

Download: MP3 Full Show

NOTE: If you want to download our complete archives as a feed - that's all 271 shows, please subscribe to the Complete MP3 Feed here.

Also, please do take a moment and review the show on iTunes.

Subscribe: Subscribe to Hanselminutes or Subscribe to my Podcast in iTunes or Zune

Do also remember the complete archives are always up and they have PDF Transcripts, a little known feature that show up a few weeks after each show.

Telerik is our sponsor for this show.

Building quality software is never easy. It requires skills and imagination. We cannot promise to improve your skills, but when it comes to User Interface and developer tools, we can provide the building blocks to take your application a step closer to your imagination. Explore the leading UI suites for ASP.NET AJAX,MVC,Silverlight, Windows Forms and WPF. Enjoy developer tools like.NET Reporting, ORM, Automated Testing Tools, Agile Project Management Tools, and Content Management Solution. And now you can increase your productivity with JustCode, Telerik’s new productivity tool for code analysis and refactoring. Visit www.telerik.com.

As I've said before this show comes to you with the audio expertise and stewardship of Carl Franklin. The name comes from Travis Illig, but the goal of the show is simple. Avoid wasting the listener's time. (and make the commute less boring)

Enjoy. Who knows what'll happen in the next show?

About Scott

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.

facebook twitter subscribe
About   Newsletter
Sponsored By
Hosting By
Dedicated Windows Server Hosting by SherWeb

Using Code Signing Certificates to sign downloaded MSIs and build reputation with IE9 SmartScreen

June 27, '11 Comments [17] Posted in ASP.NET | Learning .NET | Musings
Sponsored By

First, let me start that if you want a lot of people to download something, make sure that the words "HTML5," "Support" and "Update" appear in the title. I'm sure if the folks that are making Diablo 3 called it "Diablo 3 HTML5 Support Update" that a metric buttload more people would download it.

That said, a bunch of folks in the Web Platform and Tools team created the Web Standards Update package with HTML5 Support for the Visual Studio 2010 Editor.

This Web Standards Update is something that anyone in the community could have released, just extending Visual Studio in a standard way. Like many other (most) extensions in Visual Studio Extension Gallery, it was not "signed." It was not a formal project done by Microsoft. Ratherthis was something that a bunch of us did for the community in our after work hours.  The only reason why this got in spotlight was because press caught the wind of it having HTML5 and CSS3 support. 

Certainly a lot of people wanted it because in 4 days it's now the #1 most popular thing in the Visual Studio Gallery. Take that NuGet! ;)

Here's where the trouble starts. Then, it was written about in the press as if it were a "gaffe." I admit that we (mostly I) did a lousy mediocre job of making it clear that this update was a "community update from the inside," as it were. It's not official, but we're hoping support like this will make its way into the next version of Visual Studio.

When you downloaded the MSI installer with IE9, as with all MSIs that aren't signed, you get a message like this:

Do you want to Run or Save this MSI?

And that's normal and quite lovely. Then we see this scary red bar (this is a shot from another gallery item):

SmartScreen Red (BAD) Bar

This is the IE9 SmartScreen system warning us, rightfully so, that this is not something downloaded all the time. In fact, this is a really useful feature of IE9 and is fairly unique amongst the browsers so far. It's using some special sauce (some hash, some math, some metrics) to make a non-biased judgment about this download. Even though it's coming from a Microsoft.com website it doesn't matter. SmartScreen is unbiased. It's never seen this before, and it's not trusted.

UPDATE: Looks like as of my test just now that SmartScreen now recognizes our download as safe!

At this point, if I click Actions, I see this. (Yes, I realize these screenshots aren't all up to snuff).

  (38)

In fact, for most people, they can't even click "Run Anyway" yet. They'll have to click More Options to see the Run Anyway button. (If I am a developer-type and click More Options all the time, presumably I either know what I'm doing, or I like to live dangerously and the More Options choice will stick open after several downloads. It'll save me a click, but all the other warnings remain.)

As the publisher, we have a few choices. We could sign the binary file (the MSI) with the Microsoft code certificate. However, that requires a big manager to sign off and says explicitly that Microsoft is releasing this code officially. It's a big deal. This wasn’t an official release and as such, we can't sign it as Microsoft. A code signing certificate guarantees that a file hasn't been tampered with and that a known and verified organization or individual stands behind it.

Eventually SmartScreen would figure out that our MSI was OK, but we have no way of telling how long that would take. Could be weeks, months, it all depends. Regardless, the right thing to do is to sign your code, even if you are an individual or small company. For example, if I download Eric Lawrence's Fidder or Rick Brewster's Paint.NET, they are both signed and I can see their names in the User Account Control (UAC) dialog. I can click and view their certificates and know I'm downloading a file that has someone vouching for it.

Be sure to check out Eric Lawrence's excellent post on Authenticode Code Signing. It's extremely detailed and worth your time.

Getting a Code Signing Certificate

I got a Code Signing Certificate from InstantSSL.com. There's many options, they are one. It's spendy, $180 a year, or $166 a year if you got for 3 years, but I can use it for other stuff.

There's a few gotchas in the process, no matter who you pick.

  • Use the same computer, same OS, and same browser (preferably IE, for this, no joke) when you sign up for the certificate. That's because half the certificate (a cert request cert) comes down when you request a certificate and they match them up when you actually get the certificate.
  • Have P.O. Box, corporate address or ask them via tech support to remove your address. Otherwise your full details may get embedded in the cert.
  • You'll need to prove who you are. More on that now.

You'll need to prove you are really you. I needed to give their verification people a copy of the first page of my passport, driver's license, two utility bills, including phone whose address matched my credit card's address, AND they called the phone number on my utility bill to confirm it was really me. It's non-trivial, it takes a while, and they aren't screwing around. Good for you, the consumer, hassle for me, the producer. Still, good stuff.

Certificate Manager with my new Cert

When my cert shows up, I need to Export it and save it in a safe place with all its details and a strong password. It's unique and should be protected.

Signing Code

The actual signing, once the cert shows up is not too hard. Here's a command line used with the signtool.exe that came with Visual Studio. You can also download it separately.

C:\DEV> signtool sign /t http://timestamp.comodoca.com/authenticode /f "C:\DEV\HanselmanCODESIGNINGCERT.pfx" /p SecretPassword '.\MySpecial.msi'
Done Adding Additional Store
Successfully signed and timestamped: .\MySpecial.msi

When someone tries to download the new signed MSI, they see this slightly less scary yellow bar. What? I don't get a free pass for signing my code?

SmartScreen Yellow Bar

Well, just like getting an SSL certificate doesn't make me a bank, getting a Code Signing Certificate doesn't make me more trustworthy.

  • SSL Certificates for HTTPS guarantee privacy, not trust.
  • Code Signing Certificates guarantee identity, not trust.
    • It guarantees it's me, but you have to decide if you trust me.

If you click Actions now, you'll see my name as the Publisher, and you can validate the certificate and decide if you trust me. But SmartScreen doesn't trust me yet. Why?

 My code signing certificate in the Run Dialog

That's because my Certificate, unlike the Microsoft one, hasn't built up a reputation*. The "Scott Hanselman" code signing cert will have to earn trust, just as Rick Brewster and Eric Lawrence and every other signed shareware or freeware author has built trust. But, having this MSI signed means you now that I (and Mads, and Vishal, and the folks working on this MSI) stand behind it. Hopefully soon (some # days or weeks vs. downloads?) SmartScreen will trust us also, and this will make future projects I sign be trusted faster. At that point, my signed code will be trusted and SmartScreen won't frighten you with this download.

Remember also that code signing certificates and the Windows experience and UI for running signed MSI and EXEs is a separate from SmartScreen. They work together and compliment each other though. Learn more about SmartScreen on their team blog or their FAQ.

Hope this helps! Surf smart, and think about what you download and who you trust.

* Now it appears that SmartScreen trusts me!

About Scott

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.

facebook twitter subscribe
About   Newsletter
Sponsored By
Hosting By
Dedicated Windows Server Hosting by SherWeb

Disclaimer: The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.