Scott Hanselman

Why the #AskObama Tweet was Garbled on Screen: Know your UTF-8, Unicode, ASCII and ANSI Decoding Mr. President

July 7, '11 Comments [54] Posted in Internationalization | Musings
Sponsored By

The Washington Post and the Garbled TweetUPDATE: The contractor/vendor that made the software commented on Hacker News with more technical information. They're a very classy shop and have handled this REALLY minor gaffe very well, to their credit. I mean, let's put this into perspective, it's a fun nit, it's a weird thing that only we programmers understand, but ultimately what we can all agree on is Obama should outlaw Smart Quotes immediately.

The Speaker of the House of Representatives John Boehner tweeted this a few days ago. Note that this is not a political blog post.

After embarking on a record spending binge that’s left us deeper in debt, where are the jobs? #AskObama

During the #AskObama Live Twitter event, the Tweets then came up on a big Plasma screen. This tweet came up "garbled" and said:

After embarking on a record spending binge that’s left us deeper in debt, where are the jobs? #AskObama

And a million programmers, regardless of political party, groaned in unison. First, because someone screwed up their UTF-8 decoding, by not doing it, and second, because our President doesn't recognize a text encoding bug when he sees one! Well, maybe that second one was just me, but still. Tragic. The President then teased the Speaker for his typing while newspapers and news organizations struggled to get their minds around this "garbled tweet."

Well, Boehner could have tweeted "that's left us deeper..." but he tweeted "that’s." Note the "smart" apostrophe. He used Tweetdeck to tweet it, and it was likely on a Mac. It's also possible that he wrote the tweet in Microsoft Word then copy pasted it as Word loves to change quotes and apostrophes ' into smart quotes and smart apostrophes with direction like this ’.

I can get John Boehner's User ID (not his twitter name, but the number that represents John) with this online tool http://www.idfromuser.com. I see that it's 5357812, so I can get his timeline as RSS (Really Simple Syndication)/XML like this: http://twitter.com/statuses/user_timeline/5357812.rss or JSON (JavaScript Object Notation) like this http://twitter.com/statuses/user_timeline/5357812.json 

When I ask for this timeline, the HTTP Headers say it's encoded as "UTF-8", see?

Content-Type: application/json; charset=utf-8

I blogged about the "Importance of being UTF-8"  about five years ago. If you look at the JSON and find the tweet with the ID 88618213008621568, you can see the raw text encoded in JSON:

"text":"After embarking on a record spending binge that\u2019s left us deeper in debt, where are the jobs?"

See that \u2019? In Windows (you have this program even if you aren't a developer) go to the Start Menu and run "Charmap." Look around and you can see U+2019 is Right Single Quotation Mark. Note that it's WAY down in the list of all the characters. It's not a basic character like A to Z or a to z. It's one of those special things that looks nice, but causes trouble later.

Character Map

If I make a text file in Notepad that looks like this and name it text.txt, for example, and Save As, making sure to use UTF-8 as the encoding...

After embarking on a record spending binge that’s left us deeper in debt, where are the jobs?

...then load it into any free HEX editor (or even an online one!) I see this:

The Tweet in a Hex Editor - E2, 80, and 99 are highlighted

Note that the part where the ’ was is actually three full bytes! E2 80 99.

Well, UTF-8 is an encoding whose goal was to not only support a bajillion different characters but also to be backwards compatible with ASCII, the American Standard Code for Information Interchange. If it wasn't, we wouldn't be able to see MOST of the characters in this tweet! In this case, just the ’ is goofy.

The code point was U+2019, which is 0010 0000 0001 1001, says Windows Calculator in Programmer Mode. You have this too, Dear Reader. There's some variable width encoding going on, that you can read about on Wikipedia.

This value of U+2019 expands to: 0010 0000 0001 1001, as I said, which then expands acording to these rules

zzzzyyyy yyxxxxxx ->
1110zzzz
10yyyyyy
10xxxxxx

Which gives us this

11100010 -> E2
10000000 -> 80
10011001 -> 99

hence, "that’s" is encoded as

74 68 61 74 E2 80 99 73

I've bolded the ’. Which then, read back in - this time as Extended ASCII (the ANSI Windows 1252 Code page) we get the ’ expanded:

that’s

Made it this far? Why didn't I just say "The software read in a UTF-8 encoded JSON stream of tweets and displayed it with an ANSI Windows Code Page 1252." Because that wouldn't be nearly as fun.

Either way, the company that did this for the White House definitely goofed up and should have tested this. This is SUCH a classic sloppy programmer mistake that I'm disappointed to see it showcased so blatantly. I hope they (the vendor) feel a little bad. The company appears to be called "Mass Relevance" and here's some news articles about Mass Relevance and their "Tweet Curation."

Testing, testing, testing,  my friends. And not only testing, but KNOW this stuff. They don't always teach it in schools and no one will learn until they see their bug on national TV in front of the President of the United States. ;)

UPDATE: The vendor said this in the comments. Very well said.

"It was definitely a mistake on our part. The problem was not the encoding on our data feed, but the HTML document was sent with ISO-8859-1. The second we inserted the twitter text into the DOM, the browsers interpreted the UTF-8 string as ISO-8859-1. Our visualizations are hosted on other platforms, and in this case the server was not configured to send UTF-8 with text/html even though the HTML file was encoded as such. It was the only issue (albeit a pretty obvious one) during an otherwise flawless event. I apologize to President Obama, Speaker Boehner, and Jack Dorsey for the mistake. If the readers of the blog think it was stupid, imagine how we felt. dev environment != production environment. If we would have just included a <meta charset="utf-8"> in the HTML head, then this would not have occurred.

The big take away is don’t make assumptions about other platforms (especially when it comes to encoding), and always include charset meta tag."

Mass Relevance

Text encoding is fun for all ages. Enjoy!

* Like this post? Put me on TV, folks. This is the kind of stuff that a real technology journalist *Pogue* would love to share with the people! ABC News? I'm available and I have Skype. Call my people. ;)

About Scott

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. I am a failed stand-up comic, a cornrower, and a book author.

facebook twitter subscribe
About   Newsletter
Sponsored By
Hosting By
Dedicated Windows Server Hosting by ORCS Web

Globalization, Internationalization and Localization in ASP.NET MVC 3, JavaScript and jQuery - Part 1

May 26, '11 Comments [35] Posted in ASP.NET | ASP.NET MVC | Internationalization | Javascript
Sponsored By

There are several books worth of information to be said about Internationalization (i18n) out there, so I can't solve it all in a blog post. Even 9 pages of blog posts. I like to call it Iñtërnâtiônàlizætiøn, actually.

There's a couple of basic things to understand though, before you create a multilingual ASP.NET application. Let's agree on some basic definitions as these terms are often used interchangeably.

  • Internationalization (i18n) - Making your application able to support a range of languages and locales
  • Localization (L10n) - Making your application support a specific language/locale.
  • Globalization - The combination of Internationalization and Localization
  • Language - For example, Spanish generally. ISO code "es"
  • Locale - Mexico. Note that Spanish in Spain is not the same as Spanish in Mexico, e.g. "es-ES" vs. "es-MX"

Culture and UICulture

The User Interface Culture is a CultureInfo instance from the .NET base class library (BCL). It lives on Thread.CurrentThread.CurrentUICulture and if you felt like it, you could set it manually like this:

Thread.CurrentThread.CurrentUICulture = new CultureInfo("es-MX");

The CurrentCulture is used for Dates, Currency, etc.

Thread.CurrentThread.CurrentCulture = new CultureInfo("es-MX"); 

However, you really ought to avoid doing this kind of stuff unless you know what you're doing and you really have a good reason.

The user's browser will report their language preferences in the Accept-Languages HTTP Header like this:

GET http://www.hanselman.com HTTP/1.1
Connection: keep-alive
Cache-Control: max-age=0
Accept-Encoding: gzip,deflate,sdch
Accept-Language: en-US,en;q=0.8

See how I prefer en-US and then en? I can get ASP.NET to automatically pass those values and setup the threads with with the correct culture. I need to set my web.config like this:



...snip...

That one line will do the work for me. At this point the current thread and current UI thread's culture will be automatically set by ASP.NET.

The Importance of Pseudointernationalization

Back in 2005 I updated John Robbin's Pseudoizer (and misspelled it then!) and I've just ported it over to .NET 4 and used it for this application. I find this technique for creating localizable sites really convenient because I'm effectively changing all the strings within my app to another language which allows me to spot strings I missed with the tedium of translating strings.

You can download the .NET Pseudoizer here.

UPDATE: I've put the source for Pseudoizer up on GitHub. You are welcome to fork/clone it and send pull requests or make your own versions.

Here's an example from that earlier post before I run it through Pseudointernationalization:


Transaction Download


View Statement


Select an account below to view or download your available online statements.

I can convert these resources with the pseudoizer like this:

PsuedoizerConsole examplestrings.en.resx examplestrings.xx.resx

and here's the result:


[Ŧřäʼnşäčŧįőʼn Đőŵʼnľőäđ !!! !!!]


[Vįęŵ Ŝŧäŧęmęʼnŧ !!! !!!]


[Ŝęľęčŧ äʼn äččőūʼnŧ þęľőŵ ŧő vįęŵ őř đőŵʼnľőäđ yőūř äväįľäþľę őʼnľįʼnę şŧäŧęmęʼnŧş. !!! !!! !!! !!! !!!]

Cool, eh? If you're working with RESX files a lot, be sure to familiarize yourself with the resgen.exe command-line tool that is included with Visual Studio and the .NET SDK. You have this on your system already. You can move easily between the RESX XML-based file format and a more human- (and translator-) friendly text name=value format like this:

resgen /compile examplestrings.xx.resx,examplestrings.xx.txt

And now they are a nice name=value format, and as I said, I can move between them.

Accounts.Download.Title=[Ŧřäʼnşäčŧįőʼn Đőŵʼnľőäđ !!! !!!]
Accounts.Statements.Action.ViewStatement=[Vįęŵ Ŝŧäŧęmęʼnŧ !!! !!!]
Accounts.Statements.Instructions=[Ŝęľęčŧ äʼn äččőūʼnŧ þęľőŵ ŧő vįęŵ őř đőŵʼnľőäđ yőūř äväįľäþľę őʼnľįʼnę şŧäŧęmęʼnŧş. !!! !!! !!! !!! !!!]

During development time I like to add this Pseudoizer step to my Continuous Integration build or as a pre-build step and assign the resources to a random language I'm NOT going to be creating, like Polish (with all due respect to the Poles) so I'd make examplestrings.pl.resx and the then we can test our fake language by changing our browser's UserLanguages to prefer pl-PL over en-US.

Localization Fallback

Different languages take different amounts of space. God bless the Germans but their strings will take an average of 30% more space than English phrases. Chinese will take 30% less. The Pseudoizer pads strings in order to illustrate these differences and encourage you to take them into consideration in your layouts.

Localization within .NET (not specific to ASP.NET Proper or ASP.NET MVC) implements a standard fallback mechanism. That means it will start looking for the most specific string from the required locale, then fallback continuing to look until it ends on the neutral language (whatever that is). This fallback is handled by convention-based naming. Here is an older, but still excellent live demo of Resource Fallback at ASPAlliance.

For example, let's say there are three resources. Resources.resx, Resources.es.resx, and Resources.es-MX.resx.

Resources.resx:
HelloString=Hello, what's up?
GoodbyeString=See ya!
DudeString=Duuuude!

Resources.es.resx:
HelloString=¿Cómo está?
GoodbyeString=Adiós!

Resources.es-MX.resx:
HelloString=¿Hola, qué tal?

Consider these three files in a fallback scenario. The user shows up with his browser requesting es-MX. If we ask for HelloString, he'll get the most specific one. If we ask for GoodbyeString, we have no "es-MX" equivalent, so we move up one to just "es." If we ask for DudeString, we have no es strings at all, so we'll fall all the way back to the neutral resource.

Using this basic concept of fallback, you can minimize the numbers of strings you localize and provide users with not only language specific strings (Spanish) but also local (Mexican Spanish) strings. And yes, I realize this is a silly example and isn't really representative of Spaniards or Mexican colloquial language.

Views rather than Resources

If you don't like the idea of resources, while you will still have to deal with some resources, you could also have difference views for different languages and locales. You can structure your ~/Views folders like Brian Reiter and others have. It's actually pretty obvious once you have bought into the idea of resource fallback as above. Here's Brian's example:

/Views
/Globalization
/ar
/Home
/Index.aspx
/Shared
/Site.master
/Navigation.aspx
/es
/Home
/Index.aspx
/Shared
/Navigation.aspx
/fr
/Home
/Index.aspx
/Shared
/Home
/Index.aspx
/Shared
/Error.aspx
/Footer.aspx
/Navigation.aspx
/Site.master

Just as you can let ASP.NET change the current UI culture based on UserLanguages or a cookie, you can also control the way that Views are selected by a small override of your favorite ViewEngine. Brian includes a few lines to pick views based on a language cookie on his blog.

He also includes some simple jQuery to allow a user to override their language with a cookie like this:

var mySiteNamespace = {}

mySiteNamespace.switchLanguage = function (lang) {
$.cookie('language', lang);
window.location.reload();
}

$(document).ready(function () {
// attach mySiteNamespace.switchLanguage to click events based on css classes
$('.lang-english').click(function () { mySiteNamespace.switchLanguage('en'); });
$('.lang-french').click(function () { mySiteNamespace.switchLanguage('fr'); });
$('.lang-arabic').click(function () { mySiteNamespace.switchLanguage('ar'); });
$('.lang-spanish').click(function () { mySiteNamespace.switchLanguage('es'); });
});

I'd probably make this a single client event and use data-language or an HTML5 attribute (brainstorming) like this:

$(document).ready(function () {
$('.language').click(function (event) {
$.cookie('language', $(event.target).data('lang'));
})
});

But you get the idea. You can set override cookies, check those first, then check the UserLanguages header. It depends on the experience you're looking for and you need to hook it up between the client and server

Globalized JavaScript Validation

If you're doing a lot of client-side work using JavaScript and jQuery, you'll need to get familiar with the jQuery Global plugin. You may also want the localization files for things like the DatePicker and jQuery UI on NuGet via "install-package jQuery.UI.i18n."

Turns out the one thing you can't ask your browser via JavaScript is what languages it prefers. That is sitting inside an HTTP Header called "Accept-Language" and looks like this, as it's a weighted list.

en-ca,en;q=0.8,en-us;q=0.6,de-de;q=0.4,de;q=0.2

We want to tell jQuery and friends about this value, so we need access to it from the client side in a different way, so I propose this.

This is Cheesy - use Ajax

We could do this, with a simple controller on the server side:

public class LocaleController : Controller {
public ActionResult CurrentCulture() {
return Json(System.Threading.Thread.Current.CurrentUICulture.ToString(), JsonRequestBehavior.AllowGet);
}
}

And then call it from the client side. Ask jQuery to figure it out, and be sure you have the client side globalization libraries you want for the cultures you'll support. I downloaded all 700 jQuery Globs from GitHub. Then I could make a quick Ajax call and get that info dynamically from the server. I also include the locales I want to support as scripts like  /Scripts/globinfo/jquery.glob.fr.js. You could also build a dynamic parser and load these dynamically also, or load them ALL when they show up on the Google or Microsoft CDNs as a complete blob.

But that is a little cheesy because I have to make that little JSON call. Perhaps this belongs somewhere else, like a custom META tag.

Slightly Less Cheesy - Meta Tag

Why not put the value of this header in a META tag on the page and access it there? It means no extra AJAX call and I can still use jQuery as before. I'll create an HTML helper and use it in my main layout page. Here's the HTML Helper. It uses the current thread, which was automatically set earlier by the setting we added to the web.config.

namespace System.Web.Mvc
{
public static class LocalizationHelpers
{
public static IHtmlString MetaAcceptLanguage(this HtmlHelper html)
{
var acceptLanguage = HttpUtility.HtmlAttributeEncode(Threading.Thread.CurrentThread.CurrentUICulture.ToString());
return new HtmlString(String.Format("",acceptLanguage));
}
}
}

I use this helper like this on the main layout page:







   


@Html.MetaAcceptLanguage()

...

And the resulting HTML looks like this. Note that this made-up META tag would be semantically different from the Content-Language or the lang= attributes as it's part of the the parsed HTTP Header that ASP.NET decided was our current culture, moved into the client.







   



Now I can access it with similar code from the client side. I hope to improve this and support dynamic loading of the JS, however preferCulture isn't smart and actually NEEDS the resources loaded in order to make a decision. I would like a method that would tell me the preferred culture so that I might load the resources on-demand.

So what? Now when I am on the client side, my validation and JavaScript is a little smarter. Once jQuery on the client knows about your current preferred culture, you can start being smart with your jQuery. Make sure you are moving around non-culture-specific data values on the wire, then convert them as they become visible to the user.

var price = $.format(123.789, "c");
jQuery("#price").html('12345');
var date = $.format(new Date(1972, 2, 5), "D");
jQuery("#date").html(date);
var units = $.format(12345, "n0");
jQuery("#unitsMoved").html(units);

Now, you can apply these concepts to validation within ASP.NET MVC.

Globalized jQuery Unobtrusive Validation 

Adding onto the code above, we can hook up the globalization to validation, so that we'll better understand how to manage values like 5,50 which is 5.50 for the French, for example. There are a number of validation methods you can hook up, here's number parsing.

$(document).ready(function () {
//Ask ASP.NET what culture we prefer, because we stuck it in a meta tag
var data = $("meta[name='accept-language']").attr("content")
//Tell jQuery to figure it out also on the client side.
$.global.preferCulture(data);

//Tell the validator, for example,
// that we want numbers parsed a certain way!
$.validator.methods.number = function (value, element) {
if ($.global.parseFloat(value)) {
return true;
}
return false;
}
});

If I set my User Languages to prefer French (fr-FR) as in this screenshot:

Language Preference Dialog preferring French

Then my validation realizes that and won't allow 5.50 as a value, but will allow 5,50, given this model:

public class Example
{
public int ID { get; set; }
[Required]
[StringLength(30)]
public string First { get; set; }
[Required]
[StringLength(30)]
public string Last { get; set; }
[Required]
public DateTime BirthDate { get; set; }
[Required]
[Range(0,100)]
public float HourlyRate { get; set; }
}

I'll see this validation error, as the client side knows our preference for , as a decimal separator.

NOTE: It seems to me that the [Range] attribute that talks to jQuery Validation doesn't support globalization and isn't calling into the localized methods so it won't work with the , and . decimal problem. I was able to fix this problem by overriding the range method in jQuery like this, forcing it to use the global implementation of parseFloat. Thanks to Kostas in the comments on this post for this info.

jQuery.extend(jQuery.validator.methods, {
range: function (value, element, param) {
//Use the Globalization plugin to parse the value
var val = $.global.parseFloat(value);
return this.optional(element) || (val >= param[0] && val <= param[1]);
}
});
Here it is working with validity... 

The Value 4.5 is not valid for Hourly Rate

And here it is in a Danish culture working with [range]:

Localized Range

 

I can also set the Required Attribute to use specific resources and names and localized them from an ExampleResources.resx file like this:

public class Example
{
public int ID { get; set; }
[Required(ErrorMessageResourceType=typeof(ExampleResources),
ErrorMessageResourceName="RequiredPropertyValue")]
[StringLength(30)]
public string First { get; set; }
...snip...

And see this:

image

NOTE: I'm looking into how to set new defaults for all fields, rather than overriding them individually. I've been able to override some with a resource file that has keys called "PropertyValueInvalid" and "PropertyValueRequired" then setting these values in the Global.asax, but something isn't right.

DefaultModelBinder.ResourceClassKey = "ExampleResources";
ValidationExtensions.ResourceClassKey = "ExampleResources";

I'll continue to explore this.

Dynamically Localizing the jQuery DatePicker

Since I know what the current jQuery UI culture is, I can use it to dynamically load the resources I need for the DatePicker. I've installed the "MvcHtml5Templates" NuGet library from Scott Kirkland so my input type is "datetime" and I've added this little bit of JavaScript that says, do we support dates? Are we non-English? If so, go get the right DatePicker script and set it's info as the default for our DatePicker by getting the regional settings given the current global culture.

//Setup datepickers if we don't support it natively!
if (!Modernizr.inputtypes.date) {
if ($.global.culture.name != "en-us" && $.global.culture.name != "en") {
var datepickerScriptFile = "/Scripts/globdatepicker/jquery.ui.datepicker-" + $.global.culture.name + ".js";
//Now, load the date picker support for this language
// and set the defaults for a localized calendar
$.getScript(datepickerScriptFile, function () {
$.datepicker.setDefaults($.datepicker.regional[$.global.culture.name]);
});
}
$("input[type='datetime']").datepicker();
}

Then we set all input's with type=datetime. You could have used a CSS class if you like as well.

image

Now our jQuery DatePicker is French.

Right to Left (body=rtl)

For languages like Arabic and Hebrew that read Right To Left (RTL) you'll need to change the dir= attribute of the elements you want flipped. Most often you'll change the root element to or change it with CSS like:

div {
direction:rtl;
}

The point is to have a general strategy, whether it be a custom layout file for RTL languages or just flipping your shared layout with either CSS or an HTML Helper. Often folks put the direction in the resources and pull out the value ltr or rtl depending.

Conclusion

Globalization is hard and requires actual thought and analysis. The current JavaScript offerings are in flux and that's kind.

A lot of this stuff could be made boilerplate or automatic, but much of it is a moving target. I'm currently exploring either a NuGet package that sets stuff up for you OR a "File | New Project" template with all the best practices already setup and packaged into one super-package. What's your preference, Dear Reader?

The Complete Script

Here's my current "complete" working script that could then be moved into its own file. This is a work in progress, to be sure. Please forgive any obvious mistakes as I'm still learning JavaScript.

    <script>
        $(document).ready(function () {
            //Ask ASP.NET what culture we prefer, because we stuck it in a meta tag
            var data = $("meta[name='accept-language']").attr("content")

            //Tell jQuery to figure it out also on the client side.
            $.global.preferCulture(data);

            //Tell the validator, for example,
            // that we want numbers parsed a certain way!
            $.validator.methods.number = function (value, element) {
                if ($.global.parseFloat(value)) {
                    return true;
                }
                return false;
            }

            //Fix the range to use globalized methods
            jQuery.extend(jQuery.validator.methods, {
                range: function (value, element, param) {
                    //Use the Globalization plugin to parse the value
                    var val = $.global.parseFloat(value);
                    return this.optional(element) || (val >= param[0] && val <= param[1]);
                }
            });

            //Setup datepickers if we don't support it natively!
            if (!Modernizr.inputtypes.date) {
                if ($.global.culture.name != 'en-us' && $.global.culture.name != 'en') {

                    var datepickerScriptFile = "/Scripts/globdatepicker/jquery.ui.datepicker-" + $.global.culture.name + ".js";
                    //Now, load the date picker support for this language
                    // and set the defaults for a localized calendar
                    $.getScript(datepickerScriptFile, function () {
                        $.datepicker.setDefaults($.datepicker.regional[$.global.culture.name]);
                    });
                }
                $("input[type='datetime']").datepicker();
            }

        });
    </script>

Related Links

About Scott

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. I am a failed stand-up comic, a cornrower, and a book author.

facebook twitter subscribe
About   Newsletter
Sponsored By
Hosting By
Dedicated Windows Server Hosting by ORCS Web

Hanselminutes Podcast 182 - The History and Future of Web Standards with Molly Holzschlag from molly.com

October 2, '09 Comments [12] Posted in ASP.NET | ASP.NET MVC | Internationalization | Open Source | Podcast
Sponsored By

photo My one-hundred-and-eighty-second podcast is up. Scott's in Mexico this week and he's sitting down with Molly Holzschlag. Molly is a well-known Web standards advocate, instructor, and author and currently works for Opera as an evangelist. She explains the history of HTML, SGML and XML and we chat about where we think the web is headed.

Molly is on Twitter, and at http://www.molly.com.

Subscribe: Subscribe to Hanselminutes Subscribe to my Podcast in iTunes

Download: MP3 Full Show

Do also remember the complete archives are always up and they have PDF Transcripts, a little known feature that show up a few weeks after each show.

Telerik is our sponsor for this show.

Check out their UI Suite of controls for ASP.NET. It's very hardcore stuff. One of the things I appreciate aboutTelerik is their commitment to completeness. For example, they have a page about their Right-to-Left support while some vendors have zero support, or don't bother testing. They also are committed to XHTML compliance and publish their roadmap. It's nice when your controls vendor is very transparent.

As I've said before this show comes to you with the audio expertise and stewardship of Carl Franklin. The name comes from Travis Illig, but the goal of the show is simple. Avoid wasting the listener's time. (and make the commute less boring)

Enjoy. Who knows what'll happen in the next show?

About Scott

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. I am a failed stand-up comic, a cornrower, and a book author.

facebook twitter subscribe
About   Newsletter
Sponsored By
Hosting By
Dedicated Windows Server Hosting by ORCS Web

Do you have to know English to be a Programmer?

November 20, '08 Comments [123] Posted in Internationalization
Sponsored By

An interesting comment thread broke out in a recent post on Using Crowdsourcing for Expanding Localization of Products. Someone linked to a post and used the phrase:

"If you don't know English, you're not a programmer."

The post linked to didn't make the statement so boldly, but it's an interesting "link bait" phrase, isn't it? It's defintely phrased to get your attention and evoke opinions. I don't agree with it, but I wanted to dig more into the concept.

This whole conversation caught the eye of Fabrice Fonck, General Manager (GM) of Developer Content & Internationalization for DevDiv. He wrote this email to me and I wanted to share it with you. He's was a programmer before he became a manager, and English is not his first language, so I thought it fitting. I also added emphasis in spots. Fabrice believes very strongly in the usefulness of translation and translated content and has an entire organization dedicated to it, so you can understand why he'd feel strongly about this.

I began studying computer science and programming in 1985 as a freshman in a business school in France, my native country. At the time , localized versions of programming tools were not available and I will always remember when I picked up that version of GW-Basic only to realize that it was all in English. Learning programming seemed already daunting, but doing it in a foreign language only increased my level of fear. Over 20 years have gone by and English does not feel quite as foreign to me anymore, but I cannot help but think that for billions of people around the world, taking on such a double challenge may not necessarily lead to the same outcome.

Over the past 17 years in the Developer Division at Microsoft, I have devoted a large portion of my time and energy making sure our products and technologies are available in as many languages as possible because I believe it is important to make them accessible to as many people as possible around the world. During all these years, I have had the privilege of traveling to many countries around the world and I have talked to many of our customers, a number of which through interpreters. I have met many brilliant developers out there whose English language skills were limited if not practically non-existent. This anecdotal evidence is supported by our sales figures. In Japan for instance, where we have one of our largest developer population in the world, over 99% of our product sales are in Japanese. Entering that market with an English-only product is a recipe for failure. That same is true in counties such as France, Germany, Spain, Russia or China where our localized products represent over 80% of our sales. The list of countries goes on and on.

While it is true that a number of people overseas for whom English is not their native tongue will eventually learn and benefit from the vast amounts of technical content available in English, a greater number will not. That is why we continue to expand the number of languages in which Developer Division products and technologies are localized into. Cost is obviously an important factor here, especially for smaller geographies. That is why we continue to invest in technologies such as machine translation, translation wikis and CLIP, and concepts such as crowdsourcing and community engagement to drive down costs and make these languages a reality for the millions of developers out there (and aspiring developers) that do not speak English. By making our products available in all these languages, we also foster more community engagement in these languages, through blogs, forums, chat rooms, etc.

Here's some choice comments from the previous post:

Erling Paulsen: "Most articles, knowledge bases, books and so on are in English, so if you want to read up on something in depth, you need to have at least basic reading skills in English. Translating tooltips inside Visual Studio could end up causing confusion for at least new developers, as what they would see on-screen potentially did not match up with what the tutorial/book they were following." and "...I truly do appreciate that Microsoft is trying to make an effort, and I believe that MSDN has had a vast improvement in usability the past year or so. And the fact that MSFT are allowing community contribution is absolutely fantastic, but at least to me, the translation effort just seems a bit unnecessary." and "I never said, or meant to say that you need to be fluent in english to be a good programmer. And as Scott points out, the side-by-side translation feature would actually be a great way for learning english."

Paul van de Loo: "Developers might as well get used to learning new languages (even if they aren't programming languages)."

Spence: ""A programmer who doesn't at least understand English is not a programmer" that's an outrageous statement. That's like saying "a musician who is deaf is not a musician" patently untrue and ridiculous. plus pretty offensive to millions of programmers."

Ramiro: "I believe that in an ideal world every programmer should speak and read enough English to be able to work, learn and interact. However (and specially in Latin America) this is still a long term goal. I really applaud the effort being put in by Microsoft and other companies to make resources more available for everyone."

Robert Höglund: "I do think we developers need a common language. When you have a problem, get a strange exception, 9/10 just googling the error message will get you the answer. I have tried developing on a Swedish version of XP but trying to search for those error messages doesn't work. Can't say i agree with the statement "If you don't know English, you're not a programmer" but it does make life easier."

Farhaneh: "I can not speak and write english very well , but i'm taking classes and reading english books in my major to make it better. because i want to be a good programmer."

Filini: "The english syntax that has been used in programming languages for the last 50 years."

John Peek: "To say that if you don't know English, you're not a programmer is a perfect example of ethnocentrism in this country."

What do YOU think? Is learning English the #1 thing a Programmer should do (after learning to type)? Can you be an awesome programmer and speak little or NO English?

The comment that *I* personally agree with the most is from Ryan:

"It would *seem* (totally non-scientific sampling) that the non-english speakers (as a first language anyway) tend to agree with the statement "If you don't know English, you're not a programmer" more than native english speakers."

What do YOU think, Dear Reader?

About Scott

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. I am a failed stand-up comic, a cornrower, and a book author.

facebook twitter subscribe
About   Newsletter
Sponsored By
Hosting By
Dedicated Windows Server Hosting by ORCS Web

Using Crowdsourcing for Expanding Localization of Products

November 13, '08 Comments [33] Posted in ASP.NET | Internationalization | MSDN | Programming | Tools | Windows Client
Sponsored By

UPDATE: I wanted to add that these translation APIs are all part of Microsoft Translator services and are available for developers to use and build their own localized communities. The documentation is up on MSDN for AJAX/JSON, SOAP or POX (Plain old XML) APIs you can put in your apps today. Also, be sure to check out the Microsoft Translator Blog for more technical details on the V2 APIs and translator widget.

Not everyone in the world speaks English. Such a silly thing to say, but if you live in an English-speaking country it's easy to forget that many (most?) people in the world would prefer to do their work in the language of their choice.

Microsoft ships documentation in Visual Studio that is human-translated (a huge effort) into 9 major world languages. That's millions and millions of words * 9 languages. How can we cover more languages? How can we make documentation easier for folks who are trying to learn about our products and don't speak English fluently? How can we make English interfaces easier to use for non-English speakers who want to learn English?

Last month, I spoke to members of the internationalization/globalization team in DevDiv (Developer Division) about some of the little-known stuff they are doing. I think deserves more attention as there's some pretty innovative things being done. Some are experimental, but there's hope to expand them if they succeed.

MSDN uses Machine Translation and Crowdsourcing for Documentation

Doing a lot of work with a few people is hard. Doing a lot of work with a lot of people is confusing and expensive. However, doing a little bit of work with a LOT of interested people can be useful, cheap and fun if you "crowd-source" rather than outsource. Check out the screenshot below or visit the Brazilian MSDN site and check out the Translation Wiki v2.

BrazilianMSDN

You'll see there's the English MSDN documentation on the left, and Brazilian Portuguese on the right.

 LadoALado

Make sure to select "side-by-side" or "Lado a Lado." If you hover over a sentence on the Portuguese side, a small Edit button will appear.

image

Click Edit, and you can suggest a better translation, and they'll go into a queue for community moderators to approve. Notice also that under "Other Suggestions" you'll see existing suggested translations that are in the queue for moderation.

image

The initial Portuguese text comes from the Machine Translation team. For some reason, Portuguese is the best language that the Machine Translation team understands.

The text on the site is roughly 80% MT (Machine Translated) and 20% humans via these technique, and growing. There's a goal to include more languages for the next version of Visual Studio, including possibly Arabic, Czech, Polish and Turkish, although things are still a little up in the air.

If you know a Brazilian developer, spread the word about this project and encourage them to make edits to the Brazilian MSDN site and check out the Translation Wiki v2.

Big thanks to our community partners: a group of 30 CS students, partly from the team of Prof. Hirata and Prof. Forster of Instituto Tecnologico de Aeronautica and the team of Prof. Simone Barbosa from Pontifícia Universidade Católica who post-edited 1.8 million words of MT'ed content; the Brazilian Terminologist who managed the glossary project with our MVPs; and finally the Academic Evangelist Team in DPE in Brazil who gave us their support throughout the project.

It'll be interesting to see how far this project goes and what other languages can benefit from it.

Captions Language Interface Pack (CLIP) - includes 9 more partial language translations for Visual Studio

Here's a description of the CLIP from a launching page:

"The Microsoft Captions Language Interface Pack (CLIP) is a simple language translation solution that uses tooltip captions to display results. Use CLIP as a language aid, to see translations in your own dialect, update results in your own native tongue or use it as a learning tool."

This is pretty clever. It's a background application that will show balloon tooltip help in your language while you work in the English version of Visual Studio. For example, in the screenshot below, I'm hovering my mouse over Start Debugging, and the Arabic CLIP pops up with a human translation of that menu item.

clip

It'll even help with other applications within Windows if it thinks it's got a decent translation, but for now, it is focused on correct translation for common Visual Studio options.

Even better, you can add translations of your own. In future versions, there's talk about setting up sharing (I figure you can hack it today, though, unsupported, by sharing the language database.

image

Visual Studio CLIP is available in these languages so far, all created with community and student help!

In addition to the CLIP, there's also the ability to do a Language Pack for the Visual Studio interface itself, as exemplified by the Brazilian Visual Studio Express Language Pack for SP1 that does about a 70% translation of VS into Portuguese. There's talk to do more of these also. That should make Carlos Quintero happy!

There's a lot of cool possibilities for all this technology, expanding MSDN and VS to as many languages as possible!

If you think this kind of thinking is pretty cool, leave a comment or blog about it and maybe we'll be heard by *ahem* the boss when he next (soon) reviews plans for this kind of community involvement. ;)

About Scott

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. I am a failed stand-up comic, a cornrower, and a book author.

facebook twitter subscribe
About   Newsletter
Sponsored By
Hosting By
Dedicated Windows Server Hosting by ORCS Web
Page 1 of 10 in the Internationalization category Next Page

Disclaimer: The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.