Detecting Representative Web Articles Using Heterogeneous Graphs
With the rapid growth of on-line news media, guarding against malicious news articles is becoming an essential requirement for on-line news service providers. Near duplicate articles are one of the most common types of malicious news articles. However, previous research has concentrated on how to improve the effectiveness and accuracy of finding near-duplicate article pairs or clusters, and not so much on the problem of deciding which of the duplicates should be deleted or retained for service. This is important problem for news services providers. Previous techniques on representative selection can be used on normal articles buthow many disadvantages when dealing with news articles.In this paper, we propose a novel heterogeneous graph based representative news article selection algorithm named HRS for finding the most valuable news from a given near-duplicate news articles. The proposed algorithm has been evaluated on real-world dataset and the experimental results show HGRS can select the representative news article effectively.