Indic-Computing Handbook
Prev		Next

Chapter 2 Tutorial: Computers and Indian Languages

Table of Contents
2.1 Introduction
2.2 Character Encodings
2.3 Fonts
2.4 The User Interface: Data Input and Output
2.5 Locales
2.6 Input methods
2.7 Output methods
2.8 Representing Indian Languages on a Computer

Author: Joseph Koshy.

Caution: This tutorial is in the process of being rewritten and revised. It has been published on our site to facilitate reviews--it is not yet at a stage where it is ready for public consumption. Please pardon the errors and please help us improve the tutorial.

In this tutorial we will take a peek ``under the hood'' and examine how computers actually process text. We will talk about character encodings, input methods, fonts, locales and many other things that the casual user is not aware of when using computers.

However, these concepts are very important for software developers wanting to add support for new languages to computers.

This tutorial covers the basics; the special issues that come up when processing Indian scripts and languages get detailed coverage in other parts of the Handbook.

After reading and digesting this tutorial, you should be able to:

Understand how computers process text and know what the term character set means.
Understand the importance of standard character set encodings for data exchange.
Be able to distinguish between fonts and glyphs, character set encodings and font encodings.
Know what the terms code point, code page, character set and character set encoding mean, and how they differ from each other.
Understand what an input method is, and be able to describe a few common input methods.

After reading and digesting this tutorial you will be in a position to use the rest of the material in this Handbook in an effective manner.

The tutorial is organized as follows:

Section 2.1 introduces you to the kinds of processing that computers do on text.
Section 2.2 covers the basic issues involved in representing a human language on a computer. It introduces you to the concept of character sets shows how character sets can be encoded for processing by a computer.
Section 2.4 examines the way humans and computers interact. We discuss the ways data is input to a computer and how data is presented human consumption.
Section 2.3 covers the basics of fonts and font technologies.
Section 2.5 covers the basics of program customization for different regions and cultural traditions.

2.1 Introduction

Computers are introduced to neophytes as ``number crunching machines'', as devices related to the ancient abacus, only more electronic. Yet, for the most part, the modern computer spends an insignificant portion of its time doing numeric calculation. Instead, the great utility of computers comes from the fact that they are excellent symbol manipulation machines.

To a computer, human readable text is merely a sequence of symbols; symbols that it can process and transform at blinding speed, in the precise manner written down in the software program that it is executing.

Here is a list of a few typical text transformation processes that computers are called on to do:

They render text into readable ``characters'' either on a display device or onto a printed page. We expect computers to do this correctly, taking into account complex typographical features like ligatures and context dependent rendering (see the sidebar Ligatures and context dependent rendering).

Ligatures and context dependent rendering

Many scripts have complex rendering conventions. For example, contrast the two ways the strings 'fi' and 'fl' in the Roman script have been typeset below. The ligatured forms are used in high-quality typography.

Scripts like Arabic use ligatures extensively. For such scripts, the ability of software to render ligatures is a crucial requirement.

Text without ligatures	Text with ligatures

Further, the correct way to render a character can depend on the surrounding context. The German character Eszett has two valid displayed forms: depending on the surrounding context it should be displayed as the two letters 'ss' or as the letter shape 'ß'.

We expect the computer to be able to change the rendered appearance of the text: for example, we may want a particular part of the displayed text to be rendered with a different weight (bold, medium, etc.) or with a different font entirely.
Computers take in input from a keyboard or other input device. For some scripts and languages this process is straightforward: a user would type at a keyboard that possesses one key per ``character'' of the script.

However, some scripts are so complex that this simple approach is out of the question. Indic and Far-Eastern scripts require the computer to give considerable assistance to the user keying in data. For example, Kanji (Chinese and Japanese) ideograms are input by requiring the user to select the required character from thousands of alternatives.
Computers are expected to assist in the process of selection and editing portions of the displayed text.

The end-user expects this process to be straightforward; in typical graphical user interfaces regions of text are manipulated using mouse gestures (clicking, dragging etc).

However, the process of mapping between the display and the underlying characters is not straightforward for some scripts. The Indic scripts, for example, use a visual ordering that is different from the manner text is represented internally in the computer. For such scripts the mapping between the internal representation and the visual presentation can be quite complex.
Computers are expected to know how to determine what constitutes a ``word'', a ``sentence'', a ``stanza'' or ``paragraph'' in the text.

We expect the computer to be able to analyze the morphological structure of text: for example, extract word roots, stems and suffixes, perform grammar checks on sentences and the like.

This requirement implies that the controlling software needs to know about the linguistic properties of the text elements that it is manipulating.
Similarly, we expect software to be able to break lines correctly, and to hyphenate words correctly.

Some languages like Sanskrit, when written using the Devanagari script tend to have very long ``words''. For such scripts, the rules that determine where a word can be hyphenated can be quite complex (see Figure 2-1).

Figure 2-1. Hyphenation positions (Devanagari)
Computers are expected to easily sort, filter, and search in textual data; indeed much of the value of information technology comes from the ability of computers to do these tasks well.

These functions are not always easy to implement correctly as some scripts in use in the world today have multiple conflicting ways of being represented inside the computer.
Computers are expected to be able to compress text or otherwise prepare it for transmission, archival and other kinds of processing. Further we expect to be able to recover the original text from the transformed version without error.

Clearly, these diverse tasks are possible only if the underlying software architecture has been designed to enable these in the first place. In subsequent sections of this tutorial we will examine the foundations over which computer processing of text is built.

Prev	Home	Next
Getting Started	Up	Character Encodings