Abstract:
Modi [moːɖiː] being ancient script that is not on the list of recognized official scripts for Indian languages; relatively little research has been done to identify handwritten characters in Modi compared to other Indian scripts. Character recognition in Modi script can be difficult because of the cursive, continuous, unconstrained, and numerous strikingly similar shapes of the characters. Other difficulties in the Modi character identification process are segmentation, noise and degradation, the presence of various skews, variations in illumination, uneven alignment, slanting lines, overlapping lines, and contacting lines. Word segmentation or recognition is ineffective for Modi script documents because they do not have any word or sentence ending symbols like other scripts. Another problem is the unavailability of a dataset covering most of the syllables required to automate transcription of Modi documents. The previous work reported on automatic Modi character recognition is on Modi characters dataset, i.e. vowels, consonants and numerals. The dataset used for recognition of characters is handwritten characters. This work did not include consonants with vowel diacritic and conjunct consonants. In 2020 the Word Transcription of Modi script to Devanagari was reported, which considered only 57 character classes in Modi. However, 57 classes are too few to capture the script's characters. We require a dataset that includes vowels, consonants, each consonant with the vowel diacritics and conjunct consonants to cover a wide variety of syllables in Modi. This demands looking at different Modi document recognition approaches and making them available in widely known scripts such as the Devanagari script. This paper presents a model to recognize the Modi text from an input image and make its transcription available in the Devanagari script. In this work, we have also created a dataset that includes Modi vowels, consonants, numerals, consonants with vowel diacritic and conjunct consonants. The dataset created consists of text in Modi and its transcription in Devanagari. Our proposed model (ModiDev_LSTM_Model) for Modi documents transcription to the Devanagari using LSTM Neural Networks showed an encouraging character accuracy of 94.67 percent. Detailed analysis of substitution errors made by the ModiDev_LSTM_Model, showed that there are seven types of error, namely 'Anusuvar' (Bindu), 'Eekar', 'Ookar', 'Ardhacandra', 'Matra', 'Aa' and 'other'. Among these, the highest percentage of substitution error was shown by 'Anusuvar', and the lowest was the 'Aa' error type.