Problems with the standard |
Home | Project Documentation | Mailing Lists | Site Map |
The Unicode standard has seen a fair degree of criticism from Indian linguists and researchers.
The standard defines code point 0904 DEVANAGARI LETTER SHORT A; however some linguists dispute the existence this character in the Devanagari script (see 1).
The Marathi script uses a grapheme that is a combination of a DEVANAGARI LETTER A and a CANDRA mark. This grapheme is missing from the Unicode standard for the Devanagari script, though a related grapheme 090D DEVANAGARI LETTER CANDRA E is present in the standard (see 1).
The published policy of the Unicode consortium is to disallow use of the 200D ZERO WIDTH JOINER (ZWJ) character to encode semantic differences. The original purpose for the ZWJ was to signal possible script ligation; so the underlying meaning of a sequence of Unicode characters was to be independent of the presence or absence of the ZWJ character inside it.
However this published policy was violated for the Devanagari script; for this script ZWJ was defined as encoding a display variants of conjunct consonants. Encoding display variants was a major deviation from the display-independent nature of the Unicode standard.
Subsequently, for Indic scripts alone, the consortium chose to define the ZWJ character as (sometimes) causing a semantic distinction.
This implies that for indic scripts two sequences of unicode codepoints that are identical except for the presence of ZWJ codepoints could sometimes represent two different words and could at other times represent an alternate display form of the same word. This inconsistency makes processing indic text difficult, for example, see 1 for an example of the complications faced when implementing a Marathi spell checker.
This, and other project documentation, can be downloaded from [ http://indic-computing.sourceforge.net/documentation.html ].
Copyright © 2001--2009 The Indic-Computing Project. Contact: jkoshy |
View document revision history |