Using maximal spanning trees and word similarity to generate hierarchical clusters of non-redundant RSS news articles

Maria Soledad Pera, Yiu Kai Dennis Ng*

*Corresponding author for this work

Research output: Contribution to journalArticleScientificpeer-review

3 Citations (Scopus)

Abstract

RSS news articles that are either partially or completely duplicated in content are easily found on the Internet these days, which require Web users to sort through the articles to identify non-redundant information. This manual-filtering process is time-consuming and tedious. In this paper, we present a new filtering and clustering approach, called FICUS, which starts with identifying and eliminating redundant RSS news articles using a fuzzy set information retrieval approach and then clusters the remaining non-redundant RSS news articles according to their degrees of resemblance. FICUS uses a tree hierarchy to organize clusters of RSS news articles. The contents of the respective clusters are captured by the representative keywords from RSS news articles in the clusters so that searching and retrieval of similar RSS news articles is fast and efficient. FICUS is simple, since it uses the pre-defined word-correlation factors to determine related (words in) RSS news articles and filter redundant ones, and is supported by well-known and yet simple mathematical models, such as the standard deviation, vector space model, and probability theory, to generate clusters of non-redundant RSS news articles. Experiments performed on (test sets of) RSS news articles on various topics, which were downloaded from different online sources, verify the accuracy of FICUS on eliminating redundant RSS news articles, clustering similar RSS news articles together, and segregating different RSS news articles in terms of their contents. In addition, further empirical studies show that FICUS outperforms well-known approaches adopted for clustering RSS news articles.

Original languageEnglish
Pages (from-to)513-534
Number of pages22
JournalJournal of Intelligent Information Systems
Volume39
Issue number2
DOIs
Publication statusPublished - 2012
Externally publishedYes

Keywords

  • Fuzzy set IR model
  • RSS news
  • Similarity measures
  • Text clustering

Fingerprint

Dive into the research topics of 'Using maximal spanning trees and word similarity to generate hierarchical clusters of non-redundant RSS news articles'. Together they form a unique fingerprint.

Cite this