JOURNAL OF MULTIMEDIA (JMM)
ISSN : 1796-2048
Volume : 2    Issue : 6    Date : November 2007

On Separation of English Numerals from Multilingual Document Images
Basanna V.Dhandra and Mallikarjun Hangarge
Page(s): 26-33
Full Text:
PDF (408 KB)


Abstract
For Optical Character Recognition (OCR) of bilingual or multilingual document containing text words
in regional language and numerals in English, it is necessary to identify different script forms before
running an individual OCR of the scripts. In this paper, an attempt is made for separation of English
numerals at word level from bilingual and trilingual documents representing Kannada, Devnagari,
Tamil, Odiya and Malayalam scripts by using discriminating features such as aspect ratio, strokes
densities, eccentricity, etc. as a tool. The k-nearest neighbour algorithm is used to classify the new
word images and the algorithm is tested on 6000 sample words with a five fold cross validation
test. The algorithm is robust with respect to font styles, sizes and noise. The results obtained are
quite encouraging.

Index Terms
Script identification, OCR, morphological reconstruction, eccentricity, and cross validation