A Naïve bayes classifier for web document summaries created by using word similarity and significant factors

Maria Soledad Pera; Yiu Kai Ng

doi:10.1142/S0218213010000285

A Naïve bayes classifier for web document summaries created by using word similarity and significant factors

Maria Soledad Pera, Yiu Kai Ng^*

^*Corresponding author for this work

Research output: Contribution to journal › Article › Scientific › peer-review

13 Citations (Scopus)

Abstract

Text classification categorizes web documents in large collections into predefined classes based on their contents. Unfortunately, the classification process can be time-consuming and users are still required to spend considerable amount of time scanning through the classified web documents to identify the ones with contents that satisfy their information needs. In solving this problem, we first introduce CorSum, an extractive single-document summarization approach, which is simple and effective in performing the summarization task, since it only relies on word similarity to generate high-quality summaries. We further enhance CorSum by considering the significance factor of sentences in documents, in addition to using word-correlation factors, for document summarization. We denote the enhanced approach CorSum-SF and use the summaries generated by CorSum-SF to train a Multinomial Naïve Bayes classifier for categorizing web document summaries into predefined classes. Experimental results on the DUC-2002 and 20 Newsgroups datasets show that CorSum-SF outperforms other extractive summarization methods, and classification time (accuracy, respectively) is significantly reduced (compatible, respectively) using CorSum-SF generated summaries compared with using the entire documents. More importantly, browsing summaries, instead of entire documents, which are assigned to predefined categories, facilitates the information search process on the Web.

Original language	English
Pages (from-to)	465-486
Number of pages	22
Journal	International Journal on Artificial Intelligence Tools
Volume	19
Issue number	4
DOIs	https://doi.org/10.1142/S0218213010000285
Publication status	Published - 2010
Externally published	Yes

Keywords

Multinomial Naïve Bayes classifier
sentence-based summaries
significant factors
word correlation

Access to Document

10.1142/S0218213010000285

Cite this

@article{143646c43d714f95aa617ec1b4c49ab7,

title = "A Na{\"i}ve bayes classifier for web document summaries created by using word similarity and significant factors",

abstract = "Text classification categorizes web documents in large collections into predefined classes based on their contents. Unfortunately, the classification process can be time-consuming and users are still required to spend considerable amount of time scanning through the classified web documents to identify the ones with contents that satisfy their information needs. In solving this problem, we first introduce CorSum, an extractive single-document summarization approach, which is simple and effective in performing the summarization task, since it only relies on word similarity to generate high-quality summaries. We further enhance CorSum by considering the significance factor of sentences in documents, in addition to using word-correlation factors, for document summarization. We denote the enhanced approach CorSum-SF and use the summaries generated by CorSum-SF to train a Multinomial Na{\"i}ve Bayes classifier for categorizing web document summaries into predefined classes. Experimental results on the DUC-2002 and 20 Newsgroups datasets show that CorSum-SF outperforms other extractive summarization methods, and classification time (accuracy, respectively) is significantly reduced (compatible, respectively) using CorSum-SF generated summaries compared with using the entire documents. More importantly, browsing summaries, instead of entire documents, which are assigned to predefined categories, facilitates the information search process on the Web.",

keywords = "Multinomial Na{\"i}ve Bayes classifier, sentence-based summaries, significant factors, word correlation",

author = "Pera, {Maria Soledad} and Ng, {Yiu Kai}",

year = "2010",

doi = "10.1142/S0218213010000285",

language = "English",

volume = "19",

pages = "465--486",

journal = "International Journal on Artificial Intelligence Tools",

issn = "0218-2130",

publisher = "World Scientific Publishing",

number = "4",

}

TY - JOUR

T1 - A Naïve bayes classifier for web document summaries created by using word similarity and significant factors

AU - Pera, Maria Soledad

AU - Ng, Yiu Kai

PY - 2010

Y1 - 2010

N2 - Text classification categorizes web documents in large collections into predefined classes based on their contents. Unfortunately, the classification process can be time-consuming and users are still required to spend considerable amount of time scanning through the classified web documents to identify the ones with contents that satisfy their information needs. In solving this problem, we first introduce CorSum, an extractive single-document summarization approach, which is simple and effective in performing the summarization task, since it only relies on word similarity to generate high-quality summaries. We further enhance CorSum by considering the significance factor of sentences in documents, in addition to using word-correlation factors, for document summarization. We denote the enhanced approach CorSum-SF and use the summaries generated by CorSum-SF to train a Multinomial Naïve Bayes classifier for categorizing web document summaries into predefined classes. Experimental results on the DUC-2002 and 20 Newsgroups datasets show that CorSum-SF outperforms other extractive summarization methods, and classification time (accuracy, respectively) is significantly reduced (compatible, respectively) using CorSum-SF generated summaries compared with using the entire documents. More importantly, browsing summaries, instead of entire documents, which are assigned to predefined categories, facilitates the information search process on the Web.

AB - Text classification categorizes web documents in large collections into predefined classes based on their contents. Unfortunately, the classification process can be time-consuming and users are still required to spend considerable amount of time scanning through the classified web documents to identify the ones with contents that satisfy their information needs. In solving this problem, we first introduce CorSum, an extractive single-document summarization approach, which is simple and effective in performing the summarization task, since it only relies on word similarity to generate high-quality summaries. We further enhance CorSum by considering the significance factor of sentences in documents, in addition to using word-correlation factors, for document summarization. We denote the enhanced approach CorSum-SF and use the summaries generated by CorSum-SF to train a Multinomial Naïve Bayes classifier for categorizing web document summaries into predefined classes. Experimental results on the DUC-2002 and 20 Newsgroups datasets show that CorSum-SF outperforms other extractive summarization methods, and classification time (accuracy, respectively) is significantly reduced (compatible, respectively) using CorSum-SF generated summaries compared with using the entire documents. More importantly, browsing summaries, instead of entire documents, which are assigned to predefined categories, facilitates the information search process on the Web.

KW - Multinomial Naïve Bayes classifier

KW - sentence-based summaries

KW - significant factors

KW - word correlation

UR - http://www.scopus.com/inward/record.url?scp=77956029713&partnerID=8YFLogxK

U2 - 10.1142/S0218213010000285

DO - 10.1142/S0218213010000285

M3 - Article

AN - SCOPUS:77956029713

SN - 0218-2130

VL - 19

SP - 465

EP - 486

JO - International Journal on Artificial Intelligence Tools

JF - International Journal on Artificial Intelligence Tools

IS - 4

ER -

A Naïve bayes classifier for web document summaries created by using word similarity and significant factors

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this