Identifying spam web pages based on content similarity

Maria Soledad Pera, Yiu Kai Ng

Research output: Chapter in Book/Conference proceedings/Edited volumeConference contributionScientificpeer-review

8 Citations (Scopus)

Abstract

The Web provides its users with abundant information. Unfortunately, when a Web search is performed, both users and search engines are faced with an annoying problem: the presence of misleading Web pages, i.e., spam Web pages, that are ranked among legitimate Web pages. The mixed results downgrade the performance of search engines and frustrate users who are required to filter out useless information. In order to improve the quality of Web searches, the number of spam pages on the Web must be reduced, if they cannot be eradicated entirely. In this paper, we present a novel approach for identifying spam Web pages that have mismatched titles and bodies and/or low percentage of hidden content. By considering the content of Web pages, we develop a spam-detection tool that is (i) reliable, since we can accurately detect 94% of spam/legitimate Web pages, and (ii) computational inexpensive, since the word-correlation factors used for content analysis are precomputed. We have verified that our spam-detection approach outperforms existing anti-spam methods by an average of 10% in terms of F-measure.

Original languageEnglish
Title of host publicationComputational Science and Its Applications - ICCSA 2008 - International Conference, Proceedings
Pages204-219
Number of pages16
EditionPART 2
DOIs
Publication statusPublished - 2008
Externally publishedYes
EventInternational Conference on Computational Science and Its Applications, ICCSA 2008 - Perugia, Italy
Duration: 30 Jun 20083 Jul 2008

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
NumberPART 2
Volume5073 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

ConferenceInternational Conference on Computational Science and Its Applications, ICCSA 2008
Country/TerritoryItaly
CityPerugia
Period30/06/083/07/08

Fingerprint

Dive into the research topics of 'Identifying spam web pages based on content similarity'. Together they form a unique fingerprint.

Cite this