Indic-Computing Handbook
Prev	Chapter 18 Indic Languages	Next

18.2 Language Information

Each chapter covering a language has the following proposed structure:

Background Information

General information about the language that would be of interest to developers. This would include brief descriptions about the following:

The history of the language.
The language family this language belongs to and how this language is related to the other languages in its family.
The number of speakers of the language, and their geographic distribution.
For languages that are widely dispersed, the cultural differences between regions that a programmer should be on the lookout for.
Short descriptions of the major variants and dialects of the language.
A mention of the primary script that this language is written in, with a link to the appropriate Handbook chapter for that script.. If the language is written using multiple scripts, this fact should be mentioned.
If tutorials and beginners guides for this language are available on the Internet, these could be briefly mentioned in this section. The reference section would then contain further information about accessing these tutorials and guides.

Linguistic Information

An introduction to the linguistic characteristics of this language, focussing on issues in Natural language processing, corpus linguistics, grammatical analysis and visual display.

Character set encodings

Some languages are supported by more than one character set encoding and some character set encodings can express more than one language.

Each character set encoding that could be used to encode this language would be mentioned here, with a link to appropriate chapter in the Handbook that examines the character set encoding in detail.

The plus points and drawbacks for each character set encoding used to encode this language would then be described. For example, a character set encoding many not cover all the characters in the primary script for this language, or many have inappropriate semantics for a few characters. Such issues would be highlighted here.

Input methods

Describe the available input methods for this language. For example, a language may be supported by:

One or more ``standard'' keyboard layouts. These would be covered in greater detail in the chapter on the corresponding script, but any language specific quirks would also be mentioned here.
Voice recognition interfaces, if available.
Handwriting recognition interfaces, if available.

For each input method, we would list the output character encodings supported by it.

We would also cover any issues in providing WYSIWYG input methods for this language.

Text processing

An introduction to the issues involved in text processing for this language would be covered in this section:

The different ways the language can be sorted.
Searching and matching semantics; what it means for one word to match another. For example, some character sets support alternate ways to encode the same string due to the presence of ``compatibility'' characters.
Morphological analysis on the language: how word roots are determined, prefix and suffix rules.

XXX: to be determined whether line-breaking rules are script dependent, or if they are language specific.

Locale information

Locale definitions are in general region and character-set encoding dependent. This chapter would, at the minimum, describe the contents of a ``popular'' locale definition--i.e., using the primary character set encoding, and for a region where computing in the language is widespread..

This section would include information about:

Time and date handling: the calendars in use, how clock time is expressed and the like.
Measures of physical quantities.
Currencies and typographical conventions for monetary quantities.
Numeric systems in use (some regions use old, non-metric numeric systems).
Salutations, typical message strings that frequently arise during application programming, etc.

Region specific differences from this ``popular'' locale definition would be highlighted here.

Typographical conventions

Typographical conventions used when rendering text for this language:

Line breaking rules.
Rules for paragraph formation and text justification.
Rules for fine typography: ligature formation, hyphenation and the like.

Some of these typographical rules could possibly be region specific; such rules would be highlighted here.

Language specific references

The last section of each language chapter would contain a list of references for people interested in understanding this language in greater depth. Examples of such references would include:

Links to the appropriate portions of the Indic-Computing Technology Map where software relevant to this language is covered.
Key organizations promoting computerization of this language.
Language resources (for example, dictionaries) available on the Internet in electronic form
Online guides and tutorials for learners of the language.