Scott Hanselman

Computer things they didn't teach you in school #2 - Code Pages, Character Encoding, Unicode, UTF-8 and the BOM

November 15, '19 Comments [16] Posted in Musings
Sponsored By

OK, fine maybe they DID teach you this in class. But, you'd be surprised how many people think they know something but don't know the background or the etymology of a term. I find these things fascinating. In a world of bootcamp graduates, community college attendees (myself included!), and self-taught learners, I think it's fun to explore topics like the ones I plan to cover in my new YouTube Series "Computer things they didn't teach you."

BOOK RECOMMENDATION: I think of this series as being in the same vein as the wonderful "Imposter's Handbook" series from Rob Conery (I was also involved, somewhat). In Rob's excellent words: "Learn core CS concepts that are part of every CS degree by reading a book meant for humans. You already know how to code build things, but when it comes to conversations about Big-O notation, database normalization and binary tree traversal you grow silent. That used to happen to me and I decided to change it because I hated being left out. I studied for 3 years and wrote everything down and the result is this book."

In the first video I covered the concept of Carriage Returns and Line Feeds. But do you know WHY it's called a Carriage Return? What's a carriage? Where did it go? Where is it returning from? Who is feeding it lines?

In this second video I talk about Code Pages, Character Encoding, Unicode, UTF-8 and the BOM. I thought it went very well.

What would you like to hear about next?


Sponsor: Like C#? We do too! That’s why we've developed a fast, smart, cross-platform .NET IDE which gives you even more coding power. Clever code analysis, rich code completion, instant search and navigation, an advanced debugger... With JetBrains Rider, everything you need is at your fingertips. Code C# at the speed of thought on Linux, Mac, or Windows. Try JetBrains Rider today!

About Scott

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.

facebook twitter subscribe
About   Newsletter
Sponsored By
Hosting By
Dedicated Windows Server Hosting by SherWeb
Wednesday, November 20, 2019 3:41:05 AM UTC
Oh boy, did this ever bring back some interesting old memories.

I was tasked to work on a text processing library. This library had to be able to take any random text file from anywhere on the internet, detect which encoding it used, convert it to Unicode, and normalize it (i.e. combine modifier characters where possible and put the rest in a known order). After all of this, the text could then be processed.

I had some knowledge of character encodings before this, but working on this library really opened my eyes. I didn't realise quite how many encodings there were out there, and how widely varied they can be. There's single byte encodings, multi-byte encodings, variable byte length encodings, surrogate characters... Never mind the encodings that are almost identical except for a couple of values (e.g. Windows code page 1252 vs ISO-8859-1).

I'm quite happy that the world is moving towards Unicode (UTF-8 and UTF-16) and getting away from the myriad of crazy encodings out there.
Jamie Anderson
Wednesday, November 20, 2019 5:45:54 AM UTC
I really like this new series.
Kolappan
Wednesday, November 20, 2019 6:20:38 AM UTC
In the last example the lower bytes also get translated to their meaning like BEL,ACK,NAK and ESC
Franc
Wednesday, November 20, 2019 6:57:51 AM UTC
This new video tutorial series are very nice presented and the topics very interesting. Keep do them.
Razvan
Wednesday, November 20, 2019 11:31:22 AM UTC
BOM is one of those pitfalls every junior dev I work with does not know how to handle. We get the occasional xml file from clients that contains a BOM.

Same goes for encoding issues. Depending on how people connect to their servers (putty with wrong encoding) and how server-side editors are configured (vi with wrong encoding) you might have to handle multiple layers of bad encoding.

These are the real trechnes I like to throw j.devs in.
Wednesday, November 20, 2019 12:30:57 PM UTC
Maybe a nice follow-up to this would be how this might apply to a real world scenario. As an example, how best to handle text which is copied from some unknown source and pasted into a web form and in the end stored in a SQL Database?

Another great video! Thank you!!
Johnny Bouder
Wednesday, November 20, 2019 4:18:30 PM UTC
Thanks for this Scott, loving the series so far. I know a little bit about Encoding (mainly that it exists and there are many different types etc). However, the idea of a BOM is new to me and although it makes sense what I'm wondering is how do you know if the first few bytes of a file are a BOM or just part of the content?

Is the BOM always a specific length, does it always start with a certain piece of data?

Thanks,

Matt
Matt Jellings
Wednesday, November 20, 2019 6:28:21 PM UTC
Simply explains why C++ fell out o favor with the single byte to multbyte character mess in 2001.
C++
Thursday, November 21, 2019 7:12:30 AM UTC
Hi,
This reminds me of an ancient (but awsome) post from joel spolsky which explains it a bit better for me.

https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
Erik
Thursday, November 21, 2019 7:22:51 AM UTC
Make a $a hundred rei current card whenever you use, get permitted and make any buy within 60 occasions of card approval. Have it? control your card. [url=http://jacel29.booklikes.com/post/1989515/zurich-model-combines-leather-dark-purple-sandals]birkenstock mens arizona leather sandal[/url] We shall respond as simply whilst we can. From the coolest new gear to superb presents and events, you ll usually know what s occurring at the co-op. Test your electronic mail in your first email and just a little surprise from rei. We ll give you a couple of messages each week. You may easily unsubscribe at any time. Over sixty nine years of service, selection, reliability. Free delivery over $200. Rinsed metallic flower gold.
Ivy [url=https://bjdclub.ru/memberlist.php?mode=viewprofile&u=141627]birkenstock rio 2 strap white sandals[/url] ankle-strap espadrille flats - silk suede - talbots. Beautiful footwear for spring or summer! Brown birkenstock mayari sandals. Teva distinctive common leather sandal, tan american eagle outfitters. Birkenstock mayari shoe in cream. Birkenstock classic z rich - anatomic footbed - great shades new indonesia slides. Grays residences with a really feel will likely be really versatile in my own wardrobe. The increase and rise of the graceful shoemail. Birkenstock absolutely the most relaxed boot for summertime 2015. The easy, breezy design of those sandals makes them the suitable complement to any summer season outfit.
Birkenstock classics are distinctive, each inside their sort and in their function. Birkenstock they re not just sneakers; somewhat the opposite. We've a huge assortment of shut sneakers, widespread classics and contemporary some concepts from our design department. Needless to say, all of them have the birkenstock features that establish our sandals as successfully: [url=http://www.folkd.com/user/allthings550]birkenstock usa arizona[/url] the perfect supplies, the birkenstock footbed and the superior level of ease clients have come to count on, for a number of types of weather and any event. Mother and father know: excellent sneakers are vital for the expansion of healthy youngsters s feet.
This can be a model that is expanded much beyond their preliminary operate and in to a correctly elegant product line. Strappy sneakers set completely with oversize glam sunglasses. Or strive a buckled blue sandal or pair of enormous espresso boots. With these footwear as your bottom, it's easy to progress in to a rose prime or maybe a striped tee relying [url=http://kbforum.dragondoor.com/members/allthings550.html]birkenstock arizona high arch (unisex)[/url] in your mood. The most recent birkenstocks give you all of the function without limitations on style. Mix and fit to your center s materials whereas maintaining the spirit of the strong company alive.
These distribution methods could possibly be picked all through the checkout process. Observe: not absolutely all items are certified for all methods. To routine your provide session, the delivery service can contact you directly. Please make sure that the contact info is suitable at checkout. Perhaps not out there for in-floor hoops and weapon safes. The development course of is finished by our development provider in a separate appointment from the provision appointment. The development service can contact you by [url=https://bjdclub.ru/memberlist.php?mode=viewprofile&u=141627]birkenstock odessa co[/url] telephone within 24 to 48 hours following your purchase to schedule your set up appointment.
Wish they did that "complete shoe" type for adults! [url=https://www.utellstory.com/profile/stories/53777]birkenstock arizona patent[/url] good discover! heat. They're a budget £20 top dollar eva ones. Perhaps not the cork footbed ones that mould to your foot. I believe they have cheapened the mannequin by offering these but nonetheless scorching for what they are. Look excessive priced for some plastic footwear or can or not it's the make that gives it. Many thanks, simply bought a pair of leather teva footwear, £16. 50 c amp;d, with the rule extra. Good charges, should just discover me a decent sock buy, cheers :d. Sorry, ought to examine prime revenue again in there, my bad :.
Thursday, November 21, 2019 12:24:20 PM UTC
dribbble
Search
Sign up
Sign in
Shots
Designers
Teams
Community
Jobs
Hiring Designers?


Popular
All Animation Branding Illustration Mobile Print Product Design Typography Web Design
Filters
Seismic Magnetic Midnight Illustration & Packaging
Seismic Magnetic Midnight Illustration & Packaging

November 20, 2019

Save
Like
Hanshin Tigers
Hanshin Tigers

November 20, 2019

Save
Like
Vacation House UX/UI map
Vacation House UX/UI map

November 20, 2019

Save
Like
Packaging
Packaging

November 20, 2019

Save
Like
Fanhood Sample Typography Templates
Fanhood Sample Typography Templates

November 20, 2019

Save
Like
Method 3.0 Mobile Blocks
Method 3.0 Mobile Blocks

about 3 hours ago

Save
Like
Bright Future
Bright Future

about 2 hours ago

Save
Like
Hooray 20K!
Hooray 20K!

November 20, 2019

Save
Like
Technology Risk Management Platform
Technology Risk Management Platform

about 3 hours ago

Save
Like
ValuBet
ValuBet

November 20, 2019

Save
Like
Whale
Whale

November 20, 2019

Save
Like
Music App Concept
Music App Concept

November 20, 2019

Save
Like
goalkick logo
goalkick logo

November 20, 2019

Save
Like
Payment History - Cards - Transaction • Mobile App
Payment History - Cards - Transaction • Mobile App

November 20, 2019

Save
Like
Organizing Chaos
Organizing Chaos

November 20, 2019

Save
Like
Application UI design
Application UI design

November 20, 2019

Save
Like
Groceries Shopping Mobile App
Groceries Shopping Mobile App

November 20, 2019

Save
Like
Disco
Disco

November 20, 2019

Save
Like
"Protect our Wildlife" AfterHoursATX 2019 Poster
"Protect our Wildlife" AfterHoursATX 2019 Poster

November 20, 2019

Save
Like
Daml Homepage Motion
Daml Homepage Motion

November 20, 2019

Save
Like
Colorplan | Package Design
Colorplan | Package Design

November 20, 2019

Save
Like
Investio Web Dashboard
Investio Web Dashboard

November 20, 2019

Save
Like
Information Architecture Sketch Kit II
Information Architecture Sketch Kit II

November 20, 2019

Save
Like
Landing page - Team Building Conference
Landing page - Team Building Conference

November 20, 2019

Save
Like
Cozy Christmas Night and Little Church
Cozy Christmas Night and Little Church

about 5 hours ago

Save
Like
Fitness Companion Mobile App
Fitness Companion Mobile App

about 2 hours ago

Save
Like
Red Dragon
Red Dragon

November 20, 2019

Save
Like
Mobile app - Tiny.Kingdom
Mobile app - Tiny.Kingdom

40 minutes ago

Save
Like
Resting Girl
Resting Girl

November 20, 2019

Save
Like
Bear
Bear

November 20, 2019

Save
Like
Fashion Website Exploration
Fashion Website Exploration

about 6 hours ago

Save
Like
Lisk website – RWD preview
Lisk website – RWD preview

November 20, 2019

Save
Like
Kulture Athletics - Measure
Kulture Athletics - Measure

November 20, 2019

Save
Like
Future/Fabric®
Future/Fabric®

November 20, 2019

Save
Like
Virtual Reality e-commerce solution #DailyUI #day073
Virtual Reality e-commerce solution #DailyUI #day073

November 20, 2019

Save
Like
kpk - product listing page
kpk - product listing page

about 4 hours ago

Save
Like
Don'tchu talk to me urr ma deputy evurr again
Don'tchu talk to me urr ma deputy evurr again

November 20, 2019

Save
Like
music composer application
music composer application

November 20, 2019

Save
Like
Applied Article, Concept
Applied Article, Concept

November 20, 2019

Save
Like
Cutting vegetables on table top view
Cutting vegetables on table top view

November 20, 2019

Save
Like
Framer is coming to the web
Framer is coming to the web

November 20, 2019

Save
Like
Sorry for the Slow Reply
Sorry for the Slow Reply

November 20, 2019

Save
Like
North Landing Page
North Landing Page

about 5 hours ago

Save
Like
Application development process
Application development process

about 2 hours ago

Save
Like
🔥Custom Emoji Sliders
🔥Custom Emoji Sliders

November 20, 2019

Save
Like
T-shirt design options
T-shirt design options

November 20, 2019

Save
Like
Healthcare iOS App User Interface
Healthcare iOS App User Interface

November 20, 2019

Save
Like
DKNG at Designer Con 2019 (Booth #2219)
DKNG at Designer Con 2019 (Booth #2219)

November 20, 2019

Save
Like

Get 10 Adobe Stock standard assets with your free 30-day trial
Don't want to see ads? Go Pro!
Free Trial
Sign up to continue
or sign in
Icon backtotop
dribbble
Show and tell for designers

What are you working on? Dribbble is a community of designers sharing screenshots of their work, process, and projects.

Icon team dribbble Icon team twitter Icon team facebook Icon team instagram Icon team blog

Dribbble
About
Help
Contact
Careers
Terms
Guidelines
Privacy
Playoffs
Shop
Testimonials
Media Kit
Advertise
API
Apps
Places
Hiring at Dribbble
Post a job
Search designers
Add your team
Directories
Jobs
Tags
Jobs for Designers
37f205f6405274a6b993b719fae6b790
Justworks
Senior Product Designer, Mobile
A1bcb8c26234bbe72225162880bf64f4
MessageBird
Brand Design Lead
A5b634cc645c6fcfe9957b01748c5986
Swimlane
Senior UI/UX Designer
ads via CarbonSeason 2 of Dribbble's Overtime podcast is here! Listen now!
ADS VIA CARBON
8,426,217
shots dribbbled
© 2019 Dribbble. All rights reserved.

Made with ♥ remotely fromSalem, MAWalnut Creek, CAVictoria, BCCentennial, COBournemouth, UKVancouver, BCMontreal, QCRoseville, MNRome, GAPeterborough, NHOakland, CAAustin, TXMystic, CTSaint Charles, MODes Moines, IASalt Lake City, UTLander, WYCote St Luc, QCWomelsdorf, PAMinneapolis, MNHighlands Ranch, COSan Francisco, CASilver Spring, MDLondon, ONPottstown, PAPhoenix, AZSacramento, CAFarmers Branch, TXMarina del Rey, CAMurray, UTOrlando, FLParis, FranceBrookline, MALos Angeles, CASan Rafael, CASan Luis Obispo, CAAtlanta, GATucson, AZ
Monday, November 25, 2019 10:45:41 AM UTC
However, the idea of a BOM is new to me and although it makes sense what I'm wondering is how do you know if the first few bytes of a file are a BOM or just part of the content?

Is the BOM always a specific length, does it always start with a certain piece of data?


The BOM is part of the content, and is always at the very start of the content. You only have to read the first few bytes to know if the BOM is present, and what encoding it indicates.

Scott mentioned UTF-8 encoding, but Unicode also defines some more standard encodings. One is UTF-16, which typically uses 2 bytes per code point but will sometimes use 4 bytes. Another is UTF-32, which always uses 4 bytes per code point. The order of these bytes can change depending on the CPU, which can be big-endian (BE, rare) or little-endian (LE, common, used by x86/x64).

Note that I'm saying "code point" rather than "character". This is because each Unicode value isn't necessarily a single character. Some values are modifiers, such as the accent mark above "a" in the letter "á". Some values aren't shown at all but have a special meaning, such as the zero width space and zero width joiner. "Code point" is the correct name that covers all of these values.

Anyway, back to the BOM. The value of the BOM is always the encoded representation of the code point U+FEFF. It comes out to the following values under the various Unicode encodings:


  • UTF-8 = EF BB BF
  • UTF-16-BE = FE FF
  • UTF-16-LE = FF FE
  • UTF-32-BE = 00 00 FE FF
  • UTF-32-LE = FF FE 00 00


This makes it really easy to identify one of the Unicode character encodings. You only need to read the first 4 bytes of the file, and if you find one of the patterns above, you know which encoding to use.

It's a lot better than the pre-Unicode days. There was no standard way to indicate which encoding was used, so you had to guess. It was very easy to guess the wrong encoding and you'd end up with gibberish.
Jamie Anderson
Monday, November 25, 2019 10:32:11 PM UTC
There is a curated list of information that Every Programmer Should Know that may be useful for additional sessions.
David Beckman
Tuesday, November 26, 2019 11:04:42 AM UTC
How can you talk about this and not say Mojibake!
Martijn
Tuesday, November 26, 2019 8:39:15 PM UTC
Refer: Unix file(1) command from 1973 to see the evolution/difficultly of determining content by file name, extension, and or leading bytes.

https://en.wikipedia.org/wiki/File_(command)
Rob
Wednesday, November 27, 2019 3:41:11 PM UTC
One standard I have to deal with is converting from EBCDIC to ASCII. This happens when dealing with sending and receiving files from the mainframe. I enjoyed this tutorial on the different encoding formats.

Also seems weird that when you write out binary you have to write it out backwards to get it to be read correctly forwards. Reading and writing binary might be a good topic to cover in a future video. Keep these videos coming.
Jason
Comments are closed.

Disclaimer: The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.