Abstract:
In this paper, we present an pure logic study on problem of word- level language identification for Konkani-English Code-Mixed Social Media Text (CMST). we describe a new dataset which contains of more than thousands posts from Facebook posts that exhibit code mixing between Konkani-English. To the best of our knowledge, our work is the first attempt at the creation of a linguistic resource for this language pair which will be made public and developed a language identification System for Konkani-English language pair. Using this Konkani-English tagged dataset we have carried out experiment on language detection at word level. We have used Different ways to solve language detection task, unsupervised dictionary-based detection technique, supervised Language identification of word level using sequence labelling using Conditional Random Fields based models, SVM, Random Forest.