Abstract:
Social networking forums like Twitter, Facebook and other blogs are easy to access and are highly popular. The growth in such rich social media content has led to the generation of petabytes of data on the web. The social media content has renewed interest in research as the trend of using multiple languages in routine communication is getting rapidly popular. Such large chat content repositories of multilingual data are usually noisy and are represented in highly sparse structures. This situation is generating increasing interest in automatically extracting and clustering aspects from multi-lingual data. The proposed research offers a novel method based on probabilistic topic model for aspect identification and extraction of aspects (explicit as well as implicit) and aspect clustering for multilingual blog data. The words in multiple languages may randomly occur within and across the blog messages. We have experimentally proved that it is possible to use this strategy to discover aspect clusters comprising of semantically implicit themes. We tested our system using FIRE 2014 dataset.