Abstract:
In social media code-mixing is getting very popular due to which there is enormous generation of noisy and sparse multilingual text which exhibits high dispersion of useful topics which people discuss. Also, the semantics is expressed across random occurrence of code-mixed words. In this paper, we propose code-mixed knowledge based LDA (cmkLDA), which infers latent topic based aspects from code-mixed social media data. We experimented on FIRE 2014, a codemixed corpus and showed that with the help of semantic knowledge from multilingual external knowledge base, cmkLDA learns coherent topic-based aspects across languages and improves topic interpretibility and topic distinctiveness better than the baseline models . The same is shown to have agreed with human judgment.