Discovering thematic knowledge from code-mixed chat messages using topic model

Asnani, K.; Pawar, J.D.

IR Home
→
Business, Commerce, Economics & Computer Sciences
→
Goa Business School
→
View Item

Discovering thematic knowledge from code-mixed chat messages using topic model

Asnani, K.; Pawar, J.D.

Source : Proc. 3. Workshop on Indian Language Data: Resources and Evaluation (WILDRE3), Portoroz, Slovenia. 24 May 2016. 2016; 104-109.

URI: http://irgu.unigoa.ac.in/drs/handle/unigoa/4461

Date: 2016

Document Type: Conference article

Abstract:

In current times, the trend of mixing two or more languages together (code-mixing) in communication on social media is very popular. Such code-mixed chat data is enormously generated and is usually noisy, sparse and exhibits high dispersion of useful topics which people discuss. In such a scenario, it is very challenging to automatically extract relevant thematic information which contributes to useful knowledge. In order to discover latent themes from multilingual data, a standard topic model called Probabilistic Latent Semantic Analysis (PLSA) is used in existing literature. However, it addresses the inter-sentence multilingualism. In this paper, we propose a novel method which is basically based on co-occurrences of words within a code-mixed message. Thus built co-occurrence matrix for chat is exposed to PLSA which is used to discover thematic knowledge from it. In such code-mixed chat text, inter-sentence, intra-sentence and intra-word level code mixing may randomly occur. We have proved with extensive experiments that it is possible to use this strategy to discover latent themes from semantic topic clusters. We tested our system using FIRE 2014 dataset.

Show full item record