TY - GEN
T1 - Identifying spam web pages based on content similarity
AU - Pera, Maria Soledad
AU - Ng, Yiu Kai
PY - 2008
Y1 - 2008
N2 - The Web provides its users with abundant information. Unfortunately, when a Web search is performed, both users and search engines are faced with an annoying problem: the presence of misleading Web pages, i.e., spam Web pages, that are ranked among legitimate Web pages. The mixed results downgrade the performance of search engines and frustrate users who are required to filter out useless information. In order to improve the quality of Web searches, the number of spam pages on the Web must be reduced, if they cannot be eradicated entirely. In this paper, we present a novel approach for identifying spam Web pages that have mismatched titles and bodies and/or low percentage of hidden content. By considering the content of Web pages, we develop a spam-detection tool that is (i) reliable, since we can accurately detect 94% of spam/legitimate Web pages, and (ii) computational inexpensive, since the word-correlation factors used for content analysis are precomputed. We have verified that our spam-detection approach outperforms existing anti-spam methods by an average of 10% in terms of F-measure.
AB - The Web provides its users with abundant information. Unfortunately, when a Web search is performed, both users and search engines are faced with an annoying problem: the presence of misleading Web pages, i.e., spam Web pages, that are ranked among legitimate Web pages. The mixed results downgrade the performance of search engines and frustrate users who are required to filter out useless information. In order to improve the quality of Web searches, the number of spam pages on the Web must be reduced, if they cannot be eradicated entirely. In this paper, we present a novel approach for identifying spam Web pages that have mismatched titles and bodies and/or low percentage of hidden content. By considering the content of Web pages, we develop a spam-detection tool that is (i) reliable, since we can accurately detect 94% of spam/legitimate Web pages, and (ii) computational inexpensive, since the word-correlation factors used for content analysis are precomputed. We have verified that our spam-detection approach outperforms existing anti-spam methods by an average of 10% in terms of F-measure.
UR - http://www.scopus.com/inward/record.url?scp=54249111606&partnerID=8YFLogxK
U2 - 10.1007/978-3-540-69848-7_18
DO - 10.1007/978-3-540-69848-7_18
M3 - Conference contribution
AN - SCOPUS:54249111606
SN - 354069840X
SN - 9783540698401
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 204
EP - 219
BT - Computational Science and Its Applications - ICCSA 2008 - International Conference, Proceedings
T2 - International Conference on Computational Science and Its Applications, ICCSA 2008
Y2 - 30 June 2008 through 3 July 2008
ER -