What is an iolist? What is a string?

elixir erlang

Posted on: 2021-05-28

I was looking back at my first post on iolists in Elixir and Erlang and realized that I never exactly defined what an iolist is. I said:

An IO list just means "a list of things suitable for input/output operations", like strings or codepoints. Functions like IO.puts/1 and File.write/2 accept "IO data", which can be either a simple string or an IO list.

That's not terrible, but not very precise.

Also, Elixir has a bunch of string-like types, as James walked through in our ElixirConf 2016 talk "String Theory", and they're very confusing.

So to put it on the internet in textual form, here's a rundown of Elixir's string-like types, including iolists.

Bitstrings, binaries, and strings

A "bitstring" is anything between << and >> markers, and it contains a contiguous series of bits in memory. Eg: <<1::size(1), 0::size(1)>> is a bitstring containing two bits.
If there happen to be 8 of those bits, or 16, or any other number divisible by 8 - in other words, if it's a series of bytes (8 bits each) - we call that bitstring a "binary". (In my opinion, it should be called a "bytestring".)
- Eg <<0, 255>> == <<0::size(8), 255::size(8)>>
- Aside: Bitstrings and binaries could be used to represent any kind of data; for example, the raw data of an integer or an image.
If all the bytes in a binary are valid UTF-8 codepoints, we call that binary a "string".

More examples:

"a" == <<97::size(8)>> because 97 is the codepoint for a, and if it's written using 8 bits, as ::size(8) specifies, it's 01100001, which is the proper UTF-8 encoding for codepoint 97 - in other words, the way to write "a" in UTF-8.
"a" != <<97::size(7)>> and in fact is not a binary at all, since it's seven bits (not a byte or multiple bytes) long
<<255::size(8)>> is a binary but is not a string because its actual bits are 11111111, which is not a valid UTF-8 byte. Valid UTF-8 bytes must start with 0 (solo bytes), 10 (continuation bytes), or one of 110, 1110, or 11110 (first of N bytes, where N is the number of leading 1s). (This makes a lot more sense if you look at the chart at the bottom of my post on Unicode.)

iolists

An iolist is a list which many Erlang functions can use for io - input/output - like writing to a file or writing to a socket. (For example, sending an HTTP response involves writing data to a TCP socket.) But not every list is an iolist. A list of maps is not an iolist, for example.

An iolist is one of the following things, as defined in the Erlang reference manual page on typespecs:

A list of binaries:
- Example: ["cat"]
A list of integers between 0 and 255 (a single byte of data, written as an integer)
- Example: [99, 97, 116]
- Aside: Such a list is also a "charlist", but a charlist can contain larger numbers as well, as long as they are valid codepoints. Eg to_charlist("hełło") returns [104, 101, 322, 322, 111]. In a charlist context, we know that the larger numbers are codepoints, but in an io context, we'd want to know what specific bytes to write, which depends on whether we're writing string data or not, and whether we're using UTF-8 encoding or something else. Specifying the bytes might mean breaking those apart into individual UTF-8 bytes and expressing each of those bytes as an integer (strings are already encoded this way).
A list containing any mix of the first two things:
- Example: ["cat", [99, 97, 116]]
A list nested arbitrarily deeply containing any mix of the first three things:
- Example: ["cat", ["cat", [99, [97, [116]]]]]

You can call :erlang.iolist_to_binary/1 with any of the examples above, whereas if you call it with [%{}] you'll get an ArgumentError. (If you want to test whether a list is a charlist, you can call :io_lib.char_list/1.)

Note: it's acceptable for iolists to be "improper lists", meaning "a list whose tail is not a list but something else". However, this is only acceptable if the tail is a binary.

Examples:

:erlang.iolist_to_binary([97 | "a"]) == "aa"
:erlang.iolist_to_binary(["a" | 97]) raises an ArgumentError

Aside: why nested and/or improper lists?

The ability to use nested lists means we can append to a list in O(1).

l = ["a"]    # => ["a"]
l = [l, "b"] # => [["a"], "b"]
l = [l, "c"] # => [[["a"], "b"], "c"]

No need to copy the whole list (because it's immutable), walk to the end of the new list and add a pointer to the new item. Instead, we just allocate a new list to wrap them both.

The ability to use improper lists means we can do the same trick but allocate fewer lists:

l = ["a"]     # => ["a"]
l = [l | "b"] # => [["a"] | "b"]
l = [l | "c"] # => [[["a"] | "b"] | "c"]

iodata and chardata

Again, from the Erlang manual, the definition of iodata is simple, given the (complicated) definitions above: iodata is either an iolist or a binary.

chardata is harder to find, but the way James defined in our ElixirConf talk "String Theory" was:

A proper or improper list of UTF-8 codepoints, strings, and/or nested chardata lists or a string

The Elixir docs describe them as:

iodata is "a list of integers representing bytes or binaries"
chardata is "a list of characters or strings"

Conclusion

I just finished typing this and I already find it hard to keep these straight again. 😅 So don't worry if you do, too.