What is...UTF8?

I’m always asked by clients to explain what certain technical terms mean, so I thought i’d reiterate them here as a bit of a reference.

So what the hell is UTF-8?

UTF-8 (Unicode Transformation Format — 8-bit) put simply is a character encoding. It takes a text character and says how it should be represented in binary.

Computers think in simple binary (or bits) 1s and 0s. If you string 8 of those bits together you get what’s known as a byte, which can be arranged in 255 different ways.

One of the first widespread encodings was ASCII (American Standard Code for Information Interchange). It uses 7 bits, to represent numbers 0–9, lowercase letters a-z and uppercase A-Z. Including commas, full stops, tabs, spaces, etc, a total of 128 different things. Originating from the telegraph it also includes telegraph specific characters which are no longer used.

This works well for the english language, it has all the characters covered, but not for non english languages. They require accents, and other special characters which unfortunately won’t fit in ASCIIs 7 bit space. So other encoding types were created for different languages. This however becomes a bit of a pain, as you have to know which encoding is used, and have it installed on your computer to decipher the bits into text.

UTF-8 unifies the different encoding types into one universal encoding. Unlike other more restricted types UTF-8 can use a range of 1 to 4 bytes.

The first byte covers the 128 English characters identical to ASCII, which allows UTF-8 to decipher ASCII text perfectly fine.

Two bytes cover almost all European Latin based, Cyrillic, Greek, Hebrew, Arabic, and other Middle eastern alphabets.

Three bytes cover all oriental alphabets including Chinese, Japanese and Korean.

And four bytes includes historic text, maths symbols and Emojis!

So next time you’re texting and someone’s typing an essay think about how many bits will be sent to you in the blink of an eye.

We see: Hi, how are you?

The computer sees: 01001000011010010010110000100000011010000110111101110111001000000110000101110010011001010010000001111001011011110111010100111111


I hope this has explained UTF-8 even to those who don’t consider themselves "techie".

This post is part of a series explaining technical subjects in simple terms, inspired by a quote from physicist Richard Feynman "If you can’t explain something in simple terms, you don’t understand it", Stay tuned.