Mathieu Larose Projects

Unicode and UTF-8

December 2020

Confused about Unicode and UTF-8? Don't worry, you are not alone. Unicode and UTF-8 are probably one of the most confused pairs of words in software development. Even for some seasoned software developers.

The objective of this post is to demystify Unicode and UTF-8 so you become comfortable working with them. We will do so by looking at a concise explanation of Unicode and UTF-8, and at a concrete example in Python.

Unicode vs UTF-8

Unicode assigns numbers to characters. For example, the capital "A" is 65 and the lowercase "a" is 97. But Unicode is not limited to English characters only. Its goal is to assign a number to every character used by humans. This includes letters (from any languages), punctuation marks, symbols, emojis, etc. As of March 2020, Unicode has 143,559 characters. So the semicolon ";" is in there, just like the lowercase e with acute "é" and the thumbs up emoji "👍".

But characters don't live in isolation, they are combined together into sequences of characters to create sentences, paragraphs or source code. And, as with any data, we need a format to be able to store them in memory or on disk, and to read them back. This is what UTF-8 does.

UTF-8 is a format that specifies how to transform (encode) a sequence of Unicode characters into a sequence of bytes, and, vice versa, how to transform (decode) a sequence of bytes into a sequence of Unicode characters. Because bytes is the only way to store information on a computer.

UTF-8 is not the only format that allows to encode/decode Unicode characters to/from bytes. You may have heard of UTF-16 or UCS-2. But UTF-8 is the most popular one.

The Unicode standard

Before I receive a complaint from the Unicode Technical Committee, I'd like to clarify something. The definition I gave above for Unicode is its usual definition. That is, the definition we use for day-to-day usage. So, when we refer to Unicode we usually mean the assignment of numbers to characters, which is called the (Unicode) code charts. But that's just one part of the Unicode standard.

In fact, UTF-8, which stands for Unicode Transformation Format 8-bit, is also another part of the Unicode standard. It's one of its encoding formats.

Unicode and UTF-8 in Python

In Python, a sequence of Unicode characters is represented as a string:

>>> s = "Hello 👋"
>>> s
'Hello 👋'
>>> type(s)
<class 'str'>

A string can be encoded to UTF-8 as an array of bytes:

>>> b = s.encode("utf-8")
>>> b
b'Hello \xf0\x9f\x91\x8b'
>>> type(b)
<class 'bytes'>

And an array of bytes can be decoded to a string:

>>> s2 = b.decode("utf-8")
>>> s2
'Hello 👋'
>>> type(s2)
<class 'str'>

Further readings

If you want to know more about Unicode and UTF-8, I recommend the Unicode and UTF-8 pages on Wikipedia.

You can also look up any Unicode characters at