MULTILINGUAL TEXT IMAGE DATABASE FOR OCR

D. N. HAKRO, A. Z. TALIB, G. N. MOJAI

Abstract


Optical Character Recognition (OCR) converts image text into editable text and provides a fast method for the data entry options. Most of the languages (scripts) of the world are enriched with OCRs. The research on Latin script has reached its maturity whereas the cursive languages are also converging to the same point. All of the OCRs, Latin or non-Latin based, need text images in their respective language script for training, testing as well as data validation and elimination of errors. This paper presents a multilingual image database created from various texts containing multiple fonts collected from various sources. We have created a multilingual, multi-font, multi-size, multi-style database of text images to be used for various language scripts of most of the world. The images were created with custom built software which converts all of the texts into word, single line, and multiline and paragraph images. We have included 84 language texts for the creation of images. The database includes Holy Quran translation of 42 languages comprising of more than 25000 text pages, various dictionaries of some other languages and scanned images of some languages like Sindhi, Urdu, Pashto and Arabic. This database can provide a platform to establish a standardized image database for the OCRs of various scripts around the world. The size of the database can be increased by adding texts and language scripts. The million words can grow to billions and can benefit more researchers working on OCR around the world and this database is freely available by sending email to the authors.

Full Text:

PDF

Refbacks

  • There are currently no refbacks.


Copyright (c) 2016 Sindh University Research Journal - SURJ (Science Series)

 Copyright © University of Sindh, Jamshoro. 2017 All Rights Reserved.
Printing and Publication by: Sindh University Press.