Abstract:
We study the problem of automatic extraction of aspects from code-mixed social media data in the form of topic clusters. To address the same, we present the background and propose a code-mixed probabilistic topic model. Unlike the standard Latent Dirichlet Allocation (LDA) model, it updates the distribution of words to distribution of cross-lingual sets. This results in enhancing LDA to process code-mixed data to generate topic clusters by i) improving the relevance of aspect clusters by restricting insignificant words from inclusion in the clusters and ii) encouraging inclusion of coherent words which are semantically related to each other. This becomes possible by leveraging cross-lingual semantic information from a multilingual dictionary called BabelNet. We call our proposed model as code-mixed semantic LDA (cms-LDA) model. Our results indicate that cms-LDA substantially improves the coherence of aspects in topic clusters as compared to the standard topic modeling counterparts. In our experiments we compared the performance of our model using three forms of data i) monolingual where data is written in a single language and the language is known. ii) code-mixed data with automatic language identification and monolingual cluster representations of the same.