Character Encodings |
Home | Project Documentation | Mailing Lists | Site Map |
The first step in enabling a computer to process a human language is to create a mapping between the kind of symbol manipulation that a computer is capable of and the desired text manipulation that we humans want.
For languages with a written script, this process starts with identifying the characteristics of the script, its constituent elements and the methods by which the constituent elements get put together to form larger entities like syllables, words, and sentences.[1]
In this chapter we will look at the process of mapping these elements of a script to a form where the symbolic processing powers of the computer can be used. After reading this chapter you should be able to:
Understand what the terms ``character'' and ``character set'' mean in the context of software.
Understand the difference between the internal representation of text inside a computer and the visible artifacts presented to human readers.
Understand what ``character set encodings'' are, and be able to name a few character set encodings covering scripts used worldwide.
Understand how having standard character set encodings helps in boosting interoperability.
A character can be (rather loosely) defined as basic element of a script.[2] In other words, characters are the elements from which larger textual units like words are created.
An important point here is that a character is not the same as a glyph in a font. A glyph is a specific visual shape that represents a character. A glyph representing the same character could be rendered in many different ways (see Figure 2-2). One way of understanding the difference is that a character is an abstract quantity which could be represented by one or a combination of glyphs in a font.
The first step in representing a script on a computer is to list all the valid characters that comprise a script. For characters that in common use, this is straightforward, but for scripts and languages that have been around for centuries, there will be historical characters that need to be either included or excluded from our list of valid characters. Similarly there could be variants of the same basic character in use in different regions; we need to decide whether such variants deserve to be treated separately or whether they can be lumped together as one character itself.
Most scripts have distinct classes of characters with distinct meanings. Some characters are used to separate words (i.e. punctuation) and some denote numbers and some denote vowels and some consonants.
Indic scripts tends to have more kinds of characters.
Some script elements serve the role of intonation modifiers: for example, they indicate that a vowel sound has to be prolonged.
Other script elements are modifiers, meaning that they do not occur by themselves, but only in conjunction with another, more primary character.
Some Indic scripts have two classes of otherwise phonetically similar characters: one class has characters which can combine with other characters and the other comprises of characters that always occur in uncombined form.
Deciding the set of characters of a script and their semantic properties requires linguistic expertise. Once this list is prepared, we refer to it as character set[3].
A character set has the following properties:
It contains all the distinct elements of the script that are of interest to us.
It does not contain trivial variations of a character (say, alternate visual forms). Each character in a character set has a distinct semantic meaning.
All the different semantic classes of characters of the script are identified and represented.
After the composition of a character set has been decided, we can then represent it in a form suitable for processing in a digital computer. Computers, you may recall from Section 2.1, process symbols or bit patterns. The process of representing the characters of a script involves mapping them to bit-patterns that a computer can then manipulate.
For example, we could choose the bit pattern 01100001 to represent the character ' a ' ( LATIN SMALL LETTER A ) in the Roman script[4]. Since it is not very convenient to use binary strings, we normally write these bit-patterns in numeric form: 01100001 is written as decimal 97 or as hexadecimal 0x61.
The mapping between a bit pattern used and the character it represents is called a character encoding. The mapping for all the characters in a character set is similarly called a character set encoding[5]. The numeric value of each bit pattern is termed its code point.
Now, if we can get industry wide acceptance of our character encoding scheme, then it would be possible for us to transfer data from one computer to another without loss of information, since both the sender and recipient would have agreed on the mappings between the bit-patterns in the data and the characters they are to represent. A widely accepted character encoding is called a standard encoding.
The important thing to note here is that encodings are just conventions. For example, decimal 97 or 01100001 denotes the character ' a ' in the ASCII character set encoding and the character ' / ' (SOLIDUS) in the IBM EBCDIC character set encoding.
Further, a given language can have more than one encoding. For example, the Chinese language has multiple encodings in widespread use; for example: Big5 and EUC.
Why would a script have multiple encodings? A given script may have multiple encodings on account of many reasons.
|
Perhaps the most familiar encoding in the English speaking world is ASCII. ASCII encodes, using 7 bits, the Roman alphabet (upper case and lower case) and adds a small number of special symbols and punctuation. There is no provision in ASCII for the accented characters common in European languages.
Since ASCII does not satisfy the needs of the European languages, the International Standards Organization created a series of extended encodings. The ISO Latin-1 encoding for example, is an 8 bit encoding. It is backwards compatible with 7 bit ASCII, and, in addition, defines the meanings of the remaining code points to include characters and punctuation used by a few European languages.
7 bit ASCII was found restrictive even by US manufacturers when they were looking to expand into markets outside the US. Large manufacturers like IBM defined their own encodings, and these are still in use today. As an example, the encoding used by the IBM PC console is named Code Page 437 and contains line drawing and graphic characters in addition to the regular ASCII character set. IBM defined code pages for a number of non-US language environments. Similarly, Microsoft Inc. has defined their own encodings for their Windows™ product line. These differ from the IBM encodings even though both encodings originally targetted the same PC platform.
Early attempts at standardizing character encodings for Indian languages included the ISCII standard.
Clearly, having so many encodings makes it difficult for software to inter-operate and share data. The Unicode consortium attempted to encode a ``master list'' of characters in their Unicode standard. Unicode is said to be able to represent every character used in languages used on earth. In this character set encoding standard every character in use in the world has been given a unique encoding, simplifying substantially, the task for creating truly international software. Many software vendors now support Unicode. The Unicode standard is covered in greater detail in Chapter 10.
There are a few more decisions that we need to take before we can represent languages on a computer. Standard encodings define the mapping between characters of a script to binary symbols inside a computer used to represent them. There are many ways this process can be acheived. We present a short taxonomy here.
Standard ASCII had only 127 unique code points, so each ``character'' defined by ASCII could be represented with 7 bits. Most computers today use 8 bit bytes, so each code point defined by ASCII can be easily represented using a bytes worth of memory.
Larger character sets (for example, Chinese, Vietnamese, Japanese and Korean) have many thousands of characters. Since an 8 bit bytes can only represent 256 code points, we need more space to represent these languages.
Wide character encodings. Wide character encodings use correspondingly larger quantities to hold the code points of the encoding being used. For example, a 32 bit quantity can represent 4294967296 unique code points.
Multi-byte encodings. Multi-byte encodings use a sequence of byte sized quantities to represent a code point that cannot be represented in 8 bits.
Figure 2-3 shows how the same Unicode code point being represented as a wide character and in multi-byte form.
Wide character encodings are convenient to process: each code point uses exactly the same number of bits in its representation. The width of a wide character needs to be large enough to contain the largest character code. In practice, implementations make this width to be a natural word size for the computer architecture, say 32 bits. This implies that a wide character representation could be wasteful in its use of computer memory.
Multi-byte encodings on the other hand can be made more space efficient, but at the cost of increased complexity of processing. Since each ``chunk'' in a multi-byte encoding is typically 8 bits in width, existing software that is geared to handling 8 bit wide chunks needs can generally be ported to handle multi-byte encodings with relative ease. Some multi-byte encodings try to be compatible with C language specifications: specifically they ensure that a NULL code point never occurs in the middle of an multi-byte encoded character.
Chapter 28 contains more information about the programming interfaces used to manipulate character data.
Encodings can be stateful or stateless. In a stateful encoding, the meaning of a symbol is dependent on both its value and on the symbols already encountered. For example, the Shift-JIS encoding used to encode the Japanese language is grouped as multiple sets, with each set containing less than 256 symbols. Special ``escape symbols'' embedded in the data stream tell the application to switch between the appropriate sets.
Stateless encodings are designed so that each numeric value retains the same meaning irrespective of the symbols that precede it in a data stream. UCS-2 is an example of a stateless encoding.
Sometime a character set encoding standard describes more information about a character set than just the mapping of internal symbols to script characters. Encoding standards like Unicode define additional properties for each character encoded by it.
Nearly every character set encoding in existence also encodes a few control and non-printing characters. For example, the ASCII character set standard uses the code point range 00--31 for ``control'' characters.
Characters would generally have the following kinds of properties:
Characters which are letters are usually used to write words of a language.
Mirrored characters come in pairs, for example a ( and ) ( LEFT PARANTHESIS and RIGHT PARANTHESIS respectively) form such a mirrored pair.
Numeric characters represent numbers. For example, 0 , and 1 represent the numeric quantities zero and one respectively.
Combining characters interact with their neighbouring characters when being displayed, resulting in a composite character shape being selected for display. Combining characters are very common in Indic scripts.
Some scripts support the concept of ``case'': for example, A and a are considered to be the same letter in Latin languages, varying only in case. Indic scripts tend to be caseless.
Some characters may be defined to have special properties, for example:
Characters performing line spacing and line boundary control (e.g., EN QUAD and ZERO WIDTH SPACE ).
Characters controlling hyphenation (e.g., HYPHENATION POINT ).
Characters intended for usage with mathematics (e.g., FRACTION SLASH ).
Characters used for bi-directional ordering (e.g., LEFT-TO-RIGHT MARK ).
... and others ...
[1] |
The vast majority of languages used in the Indian subcontinent do not have associated written scripts; they are generally written using one or more of the ``popular'' scripts in their region of use. A few languages are purely auditory and are only spoken, never written. We do not have widely accepted ways to process such languages. |
[2] |
Trying to make this definition more precise turns out to be rather hard. The precise meaning of basic turns out to be dependent on the text processes involved. The set of smallest distinguishable script elements for one kind of text process could differ from that for another kind of process. |
[3] |
A character set is not to be confused with the term charset used in X Window System literature. |
[4] |
The same bit pattern represents the character ' / ' ( SOLIDUS ) on computers that follow the older IBM EBCDIC character set. |
[5] |
The term code page is also used in older literature. |
This, and other project documentation, can be downloaded from [ http://indic-computing.sourceforge.net/documentation.html ].
Copyright © 2001--2009 The Indic-Computing Project. Contact: jkoshy |
View document revision history |