2.4.1 Input encodings

The input to TeX (or any computer program) ultimately consists of a sequence of bytes. (Nowadays, a byte is almost universally an eight-bit number, i.e., an integer between 0 and 255, inclusive.) The input encoding defines how to interpret that sequence of bytes, and thus how LaTeX behaves.

Today, by far the most common way to encode text is with UTF-8, a so-called “Unicode Transformation Format” which specifies how to transform a sequence of 8-bit bytes to Unicode code points, which are defined independent of any particular representation. The Unicode encoding defines code points for virtually all characters used today in written text.

When TeX was created, Unicode and UTF-8 did not exist and the 7-bit ASCII encoding was by far the most widely used. So TeX does not require Unicode for text input. UTF-8 is a superset of ASCII, so a pure 7-bit ASCII document is also UTF-8.

Since 2018, the default input encoding for LaTeX is UTF-8. Some methods for handling documents written in some other encoding, such as ISO-8859-1 (Latin 1), are explained in inputenc package.

You can easily find more about all these topics in any introductory computer text or online. For example, you might start at: https://en.wikipedia.org/wiki/Unicode.


Unofficial LaTeX2e reference manual