The Telugu Script

The Telugu Language is the 2nd most widely spoken language in India, and is one of the 22 official languages. It is also known as the "Roman of the east" and is very easy to learn to speak and write.

The Telugu Script is very complex for a machine to recognize, in other words, for an OCR. The language has 4 classes of symbols which form words, they are 1. vowels, 2. consonants, 3. vowel modifiers (maatras) and 4. consonant modifiers (vatthus). A small example is shown below. Words are formed as a combination of the following.

I researched an OCR for printed but real (as in from magazines/old printed material) Telugu document images. It was a formidable task to develop one, since the Telugu orthography has characters (or combinations) which are highly complicated and very close to each other in statistical, structural as well as visual senses.

