Encodings
Pulling out the veil
If you’ve written any python you might have seen the following:
with open(path, 'r', encoding='utf8') as f:
content = f.readlines()
Or perhaps
with open(path, 'rb') as f:
content = f.readlines()
Until recently, I only had the vaguest idea of:
- What the
encoding=really meant - What it meant to read raw bytes
In this note, I explore this.
Raw bytes: where to find them
The key to understanding this encoding business was to go back to basics: computers only store ones and zeros. So when you read / write a file, you are reading / writing a sequence of ones and zeros (a single one / zero is a bit; a sequence of 8 bits is a byte).
You probably knew that already at a conceptual level. But then you do:
$ echo -n "foo" > myfile
$ cat myfile
foo
and you are left wondering where the ones and zeros really are in all this?
xxd
Let’s start with the output of cat first. Why do we see foo and not a sequence of bits? That’s
because the cat program takes those bits, and maps them to characters. But so how can I bypass
that and look at the actual bits?!
You need a so-called “hex dumper”. The most popular options are xxd and hexdump. With
xxd you can view the raw bits as follows:
$ xxd -b myfile
00000000: 01100110 01101111 01101111 foo
Ok how do you read this output?
- The
00000000:part is the offset. You can ignore that for now - The three chunks of 8 bits that follow –
01100110 01101111 01101111are the actual content of the file, in binary! - The
fooat the end is the ascii representation of the bytes.
Ok, so raw bytes in the file are 01100110 01101111 01101111. It’s often more convenient (and
compact) to read bytes in hex notation. You can do that by omitting the b option.
$ xxd myfile
00000000: 666f 6f foo
Nice so the raw bytes in the file, expressed in hex, are 66 6f 6f (you can supply the -g1
option to xxd to have the bytes nicely separated). Try it on other files and see for yourself!
echo -e
Ok we now know how to see the raw bytes. But how about writing those bytes? In a way we know
how to do that already – when we wrote the file echo -n foo > myfile, we did write bytes to
a file. What I mean though, is how can I write any sequence of bytes you have in mind
to the file, without any intermediary. To be concrete, say I want to write the following sequence
of two bytes: 00111001 11110001. How can I do that?
You might think: well let’s just write it as we did before:
echo -n 0011100111110001 > myfile2
But if you xxd the content of your file, you’ll see that you haven’t written what you wanted:
xxd -b myfile2
00000000: 00110000 00110000 00110001 00110001 00110001 00110000 001110
00000006: 00110000 00110001 00110001 00110001 00110001 00110001 011111
0000000c: 00110000 00110000 00110000 00110001 00001010 0001.
we’ll discuss why later. So how do we write the ones and zeros we want!? Well, the easiest
way is to convert them in hex first. That is easy – 39 f1. Then, you can use the -e option
of echo followed by the bytes:
echo -n -e '\x39\xf1' > myfile3
The \x tells bash that what follows are bytes in hex. Don’t forget the single quotes otherwise
you won’t get the correct result.
If you hexdump the content of myfile3, you see:
xxd -b myfile3
00000000: 00111001 11110001 9.
Now let’s go back to the earlier attempt with echo -n 0011100111110001 ; why did it not work?
That is because bash interprets 0011100111110001 as a sequence of ascii characters where the
first character is ‘0’, the second is ‘0’, etc.. Now the ascii character ‘0’ is encoded (ah!
we’re talking about encodings now!) as 00110000, while the ascii character ‘1’ is encoded
as 00110001. So e.g. if you echo -n 01, you will the byte 00110000 followed by 00110001.
This is easy to verify:
echo -n 01 | xxd -b
Ok, we have everything we need to actually start talking about encodings now!
Encodings, finally
Let’s write a random sequence of bytes to have some meat to chew on:
echo -n e '\xf1\xb8\x2a\x52' > myfile4
Ok now if you try to cat the content of this file, you will see some gibberish, and then
the character ‘*’ and then ‘R’. I’ve written above that cat does some decoding, mapping
bytes to characters: the question is, how does it do it? How does the byte
sequence f1 b8 2a 52 get decoded into something?
We finally get to the heart of encoding and decoding:
-
What
f1 b8 2a 52gets decoded to is.. arbitrary; by which I mean there is nothing in those bytes that say they should be decoded in a specific way. -
People agree on certain “encodings”: e.g. ascii, utf-8, etc.. that specify how bytes are mapped to characters / glyphs, and vice-versa.
You can see how this can quickly become problematic. If I encode characters using a given encoding, and send this file to other people, how do they know the encoding that should be used to decode the file? It’s a combination of two things, really. First, there is the strength of conventions: most text file you write will be encoded as either utf-8 or ascii. Second, a lot of work has gone into making some of these encodings compatible so that for instance if your file has been encoded as ascii, then it can be decoded using the utf-8 encoding without loss of content.
Binary vs text encodings
A number of programming languages – including e.g. python – have different “modes” for reading files: regular (or text) and binary. Why? What does it mean to have text and binary modes since all files are a bunch of bytes anyways?
When you open a file in python in normal mode, it tries to decode it using the encoding you provide (by default, utf-8). When you open in binary mode, it gives you the raw bytes, without trying to do any decoding.
This means that if you try to open e.g. myfile4 – which is some arbitrary sequence of bytes,
you’ll get an error:
In [1]: with open('myfile4', 'r') as f:
...: content = f.read()
...:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
Cell In[1], line 2
1 with open('myfile4', 'r') as f:
----> 2 content = f.read()
File <frozen codecs>:322, in decode(self, input, final)
UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte
You can see that python is trying to decode the file using utf-8 encoding, and is encountering
a byte that it does not know how to decode, because it is not valid utf-8! But you can still
inspect the content of this file in python using the binary mode:
In [2]: with open('myfile4', 'rb') as f:
...: content = f.read()
...:
In [3]: content
Out[3]: b'\xf1\xb8*R'
Now you see that you don’t get a string but a bytes object:
In [5]: type(content)
Out[5]: bytes
The way python formats this object, it still tries to print things as utf-8 where it can, but
otherwise shows you the raw bytes. If you want to see the hex representation (withouth the utf-8
conversion) then you can just call the hex method:
In [6]: content.hex()
Out[6]: 'f1b82a52'
of if you want a cleaner printout:
In [8]: content.hex(' ', 1)
Out[8]: 'f1 b8 2a 52'
Nice!
Conclusion
That concludes my own exploration into all this. At this point I don’t really care about the details of different encodings (i.e. how exactly utf-8 maps certain bytes to glyphs), though I’m sure this is super interesting.
Rather I wanted to explore the following:
-
The idea that at the end of the day, files are bytes. Therefore if you open a file or print the content of a file and see text characters coming out, then it means that some decoding happened along the way: i.e. some process took those bytes, and decided that certain combinations of bytes would map to certain symbols. It also means that when you write strings into a file, at some point there is some process that decides how this should be encoded into bytes.. and that you have a choice into how this is done.
-
The tools, functions, etc.. one could use to get access to the actual bytes. It’s useful when you want to dig and explore these concepts.