Identifying spam web pages based on content similarity

Maria Soledad Pera; Yiu Kai Ng

doi:10.1007/978-3-540-69848-7_18

Identifying spam web pages based on content similarity

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

8 Citations (Scopus)

Abstract

The Web provides its users with abundant information. Unfortunately, when a Web search is performed, both users and search engines are faced with an annoying problem: the presence of misleading Web pages, i.e., spam Web pages, that are ranked among legitimate Web pages. The mixed results downgrade the performance of search engines and frustrate users who are required to filter out useless information. In order to improve the quality of Web searches, the number of spam pages on the Web must be reduced, if they cannot be eradicated entirely. In this paper, we present a novel approach for identifying spam Web pages that have mismatched titles and bodies and/or low percentage of hidden content. By considering the content of Web pages, we develop a spam-detection tool that is (i) reliable, since we can accurately detect 94% of spam/legitimate Web pages, and (ii) computational inexpensive, since the word-correlation factors used for content analysis are precomputed. We have verified that our spam-detection approach outperforms existing anti-spam methods by an average of 10% in terms of F-measure.

Original language	English
Title of host publication	Computational Science and Its Applications - ICCSA 2008 - International Conference, Proceedings
Pages	204-219
Number of pages	16
Edition	PART 2
DOIs	https://doi.org/10.1007/978-3-540-69848-7_18
Publication status	Published - 2008
Externally published	Yes
Event	International Conference on Computational Science and Its Applications, ICCSA 2008 - Perugia, Italy Duration: 30 Jun 2008 → 3 Jul 2008

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Number	PART 2
Volume	5073 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	International Conference on Computational Science and Its Applications, ICCSA 2008
Country/Territory	Italy
City	Perugia
Period	30/06/08 → 3/07/08

Access to Document

10.1007/978-3-540-69848-7_18

Cite this

Pera, M. S., & Ng, Y. K. (2008). Identifying spam web pages based on content similarity. In Computational Science and Its Applications - ICCSA 2008 - International Conference, Proceedings (PART 2 ed., pp. 204-219). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 5073 LNCS, No. PART 2). https://doi.org/10.1007/978-3-540-69848-7_18

@inproceedings{d14358e733d24a6bbd038032d59084fd,

title = "Identifying spam web pages based on content similarity",

abstract = "The Web provides its users with abundant information. Unfortunately, when a Web search is performed, both users and search engines are faced with an annoying problem: the presence of misleading Web pages, i.e., spam Web pages, that are ranked among legitimate Web pages. The mixed results downgrade the performance of search engines and frustrate users who are required to filter out useless information. In order to improve the quality of Web searches, the number of spam pages on the Web must be reduced, if they cannot be eradicated entirely. In this paper, we present a novel approach for identifying spam Web pages that have mismatched titles and bodies and/or low percentage of hidden content. By considering the content of Web pages, we develop a spam-detection tool that is (i) reliable, since we can accurately detect 94% of spam/legitimate Web pages, and (ii) computational inexpensive, since the word-correlation factors used for content analysis are precomputed. We have verified that our spam-detection approach outperforms existing anti-spam methods by an average of 10% in terms of F-measure.",

author = "Pera, {Maria Soledad} and Ng, {Yiu Kai}",

year = "2008",

doi = "10.1007/978-3-540-69848-7_18",

language = "English",

isbn = "354069840X",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

number = "PART 2",

pages = "204--219",

booktitle = "Computational Science and Its Applications - ICCSA 2008 - International Conference, Proceedings",

edition = "PART 2",

note = "International Conference on Computational Science and Its Applications, ICCSA 2008 ; Conference date: 30-06-2008 Through 03-07-2008",

}

Pera, MS & Ng, YK 2008, Identifying spam web pages based on content similarity. in Computational Science and Its Applications - ICCSA 2008 - International Conference, Proceedings. PART 2 edn, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), no. PART 2, vol. 5073 LNCS, pp. 204-219, International Conference on Computational Science and Its Applications, ICCSA 2008, Perugia, Italy, 30/06/08. https://doi.org/10.1007/978-3-540-69848-7_18

Identifying spam web pages based on content similarity. / Pera, Maria Soledad; Ng, Yiu Kai.
Computational Science and Its Applications - ICCSA 2008 - International Conference, Proceedings. PART 2. ed. 2008. p. 204-219 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 5073 LNCS, No. PART 2).

Research output: Chapter in Book/Conference proceedings/Edited volume › Conference contribution › Scientific › peer-review

TY - GEN

T1 - Identifying spam web pages based on content similarity

AU - Pera, Maria Soledad

AU - Ng, Yiu Kai

PY - 2008

Y1 - 2008

N2 - The Web provides its users with abundant information. Unfortunately, when a Web search is performed, both users and search engines are faced with an annoying problem: the presence of misleading Web pages, i.e., spam Web pages, that are ranked among legitimate Web pages. The mixed results downgrade the performance of search engines and frustrate users who are required to filter out useless information. In order to improve the quality of Web searches, the number of spam pages on the Web must be reduced, if they cannot be eradicated entirely. In this paper, we present a novel approach for identifying spam Web pages that have mismatched titles and bodies and/or low percentage of hidden content. By considering the content of Web pages, we develop a spam-detection tool that is (i) reliable, since we can accurately detect 94% of spam/legitimate Web pages, and (ii) computational inexpensive, since the word-correlation factors used for content analysis are precomputed. We have verified that our spam-detection approach outperforms existing anti-spam methods by an average of 10% in terms of F-measure.

AB - The Web provides its users with abundant information. Unfortunately, when a Web search is performed, both users and search engines are faced with an annoying problem: the presence of misleading Web pages, i.e., spam Web pages, that are ranked among legitimate Web pages. The mixed results downgrade the performance of search engines and frustrate users who are required to filter out useless information. In order to improve the quality of Web searches, the number of spam pages on the Web must be reduced, if they cannot be eradicated entirely. In this paper, we present a novel approach for identifying spam Web pages that have mismatched titles and bodies and/or low percentage of hidden content. By considering the content of Web pages, we develop a spam-detection tool that is (i) reliable, since we can accurately detect 94% of spam/legitimate Web pages, and (ii) computational inexpensive, since the word-correlation factors used for content analysis are precomputed. We have verified that our spam-detection approach outperforms existing anti-spam methods by an average of 10% in terms of F-measure.

UR - http://www.scopus.com/inward/record.url?scp=54249111606&partnerID=8YFLogxK

U2 - 10.1007/978-3-540-69848-7_18

DO - 10.1007/978-3-540-69848-7_18

M3 - Conference contribution

AN - SCOPUS:54249111606

SN - 354069840X

SN - 9783540698401

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 204

EP - 219

BT - Computational Science and Its Applications - ICCSA 2008 - International Conference, Proceedings

T2 - International Conference on Computational Science and Its Applications, ICCSA 2008

Y2 - 30 June 2008 through 3 July 2008

ER -

Pera MS, Ng YK. Identifying spam web pages based on content similarity. In Computational Science and Its Applications - ICCSA 2008 - International Conference, Proceedings. PART 2 ed. 2008. p. 204-219. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); PART 2). doi: 10.1007/978-3-540-69848-7_18

Identifying spam web pages based on content similarity

Abstract

Publication series

Conference

Access to Document

Other files and links

Fingerprint

Cite this