Internet

Fact-checked

What Is Character Encoding?

Eugene P.

Last Modified Date: February 17, 2024

Character encoding, in computer programming, is a method or algorithm used to find a usually numerical representation of a character, glyph or symbol. The use of character encoding in computers is necessary because information within computer memory and on computer-readable media is stored as sequences of bits or numbers. This requires the use of encoding to translate non-numerical characters that are used for display or human-readable output into a form that a computer can manipulate. In a more specific application, HyperText Markup Language (HTML) documents that are read by web browsers can define what type of character encoding they are using to let the browser know which specific character set to use when displaying the information in the document. There are several encoding schemes in use, though many of these proprietary and legacy sets are slowly being replaced by the Unicode® encoding standard.

In the early days of computers, when there was limited memory space, the basic characters of the English alphabet — including punctuation and numbers — were stored in 7-bit sequences allowing for 128 different characters. In this original scheme, each 7-bit byte represented one character of the English alphabet, numbered in sequence. This character encoding was efficient and was eventually standardized and used in most of the computers that were produced. Although the encoding system evolved into the Unicode® encoding standard, the concept remained the same. Namely, each single character in a language is directly related to a single number within a large standard character set, and that number is what a computer uses to store, process and index the character.

Other types of character encoding were developed for different reasons. Some that were geared specifically to the English alphabet and intended to be used for text only mapped their characters onto 7-bit sequences and then spread them across 8-bit bytes, or octets. This had the effect of saving 1 bit per octet, effectively using character encoding as a type of compression. Other encoding schemes attempted to provide base information about a character, and then additional characters to represent special accents that could be used when writing in a different language, although these were largely abandoned for the simpler one-to-one encoding methods.

In HTML documents, character encoding is roughly the same as the broader concept, except the encoding being defined encompasses an entire set of characters. This can be important not only for foreign languages, but for documents that use specific symbols for science or mathematics that are not present in all character sets. It also can be useful for using punctuation and other glyphs that might be not present or are mapped differently across encoding schemes. Documents that do not properly define a non-standard character encoding could display incorrectly or be filled with nonsensical characters and placeholders instead of readable information.