Con-Text: Text Detection for Fine-Grained Object Classification

Sezer Karaoğlu; Ran Tao; Jan van Gemert; Theo Gevers

doi:10.1109/TIP.2017.2707805

Con-Text: Text Detection for Fine-Grained Object Classification

Sezer Karaoğlu, Ran Tao, Jan van Gemert, Theo Gevers

Pattern Recognition and Bioinformatics

Research output: Contribution to journal › Article › Scientific › peer-review

30 Citations (Scopus)

Abstract

This paper focuses on fine-grained object classification using recognized scene text in natural images. While the state-of-the-art relies on visual cues only, this paper is the first work which proposes to combine textual and visual cues. Another novelty is the textual cue extraction. Unlike the state-of-the-art text detection methods, we focus more on the background instead of text regions. Once text regions are detected, they are further processed by two methods to perform text recognition, i.e., ABBYY commercial OCR engine and a state-of-the-art character recognition algorithm. Then, to perform textual cue encoding, bi- and trigrams are formed between the recognized characters by considering the proposed spatial pairwise constraints. Finally, extracted visual and textual cues are combined for fine-grained classification. The proposed method is validated on four publicly available data sets: ICDAR03, ICDAR13, Con-Text, and Flickr-logo. We improve the state-of-the-art end-to-end character recognition by a large margin of 15% on ICDAR03. We show that textual cues are useful in addition to visual cues for fine-grained classification. We show that textual cues are also useful for logo retrieval. Adding textual cues outperforms visual- and textual-only in fine-grained classification (70.7% to 60.3%) and logo retrieval (57.4% to 54.8%).

Original language	English
Article number	7933250
Pages (from-to)	3965-3980
Number of pages	16
Journal	IEEE Transactions on Image Processing
Volume	26
Issue number	8
DOIs	https://doi.org/10.1109/TIP.2017.2707805
Publication status	Published - 2017

Keywords

fine-grained classification
logo-retrieval
Multimodal fusion
text detection
text saliency

Access to Document

10.1109/TIP.2017.2707805

Cite this

@article{fb273f48bc4a4a27834c18c8f21159a8,

title = "Con-Text: Text Detection for Fine-Grained Object Classification",

abstract = "This paper focuses on fine-grained object classification using recognized scene text in natural images. While the state-of-the-art relies on visual cues only, this paper is the first work which proposes to combine textual and visual cues. Another novelty is the textual cue extraction. Unlike the state-of-the-art text detection methods, we focus more on the background instead of text regions. Once text regions are detected, they are further processed by two methods to perform text recognition, i.e., ABBYY commercial OCR engine and a state-of-the-art character recognition algorithm. Then, to perform textual cue encoding, bi- and trigrams are formed between the recognized characters by considering the proposed spatial pairwise constraints. Finally, extracted visual and textual cues are combined for fine-grained classification. The proposed method is validated on four publicly available data sets: ICDAR03, ICDAR13, Con-Text, and Flickr-logo. We improve the state-of-the-art end-to-end character recognition by a large margin of 15% on ICDAR03. We show that textual cues are useful in addition to visual cues for fine-grained classification. We show that textual cues are also useful for logo retrieval. Adding textual cues outperforms visual- and textual-only in fine-grained classification (70.7% to 60.3%) and logo retrieval (57.4% to 54.8%).",

keywords = "fine-grained classification, logo-retrieval, Multimodal fusion, text detection, text saliency",

author = "Sezer Karaoğlu and Ran Tao and {van Gemert}, Jan and Theo Gevers",

year = "2017",

doi = "10.1109/TIP.2017.2707805",

language = "English",

volume = "26",

pages = "3965--3980",

journal = "IEEE Transactions on Image Processing",

issn = "1057-7149",

publisher = "Institute of Electrical and Electronics Engineers (IEEE)",

number = "8",

}

TY - JOUR

T1 - Con-Text

T2 - Text Detection for Fine-Grained Object Classification

AU - Karaoğlu, Sezer

AU - Tao, Ran

AU - van Gemert, Jan

AU - Gevers, Theo

PY - 2017

Y1 - 2017

N2 - This paper focuses on fine-grained object classification using recognized scene text in natural images. While the state-of-the-art relies on visual cues only, this paper is the first work which proposes to combine textual and visual cues. Another novelty is the textual cue extraction. Unlike the state-of-the-art text detection methods, we focus more on the background instead of text regions. Once text regions are detected, they are further processed by two methods to perform text recognition, i.e., ABBYY commercial OCR engine and a state-of-the-art character recognition algorithm. Then, to perform textual cue encoding, bi- and trigrams are formed between the recognized characters by considering the proposed spatial pairwise constraints. Finally, extracted visual and textual cues are combined for fine-grained classification. The proposed method is validated on four publicly available data sets: ICDAR03, ICDAR13, Con-Text, and Flickr-logo. We improve the state-of-the-art end-to-end character recognition by a large margin of 15% on ICDAR03. We show that textual cues are useful in addition to visual cues for fine-grained classification. We show that textual cues are also useful for logo retrieval. Adding textual cues outperforms visual- and textual-only in fine-grained classification (70.7% to 60.3%) and logo retrieval (57.4% to 54.8%).

AB - This paper focuses on fine-grained object classification using recognized scene text in natural images. While the state-of-the-art relies on visual cues only, this paper is the first work which proposes to combine textual and visual cues. Another novelty is the textual cue extraction. Unlike the state-of-the-art text detection methods, we focus more on the background instead of text regions. Once text regions are detected, they are further processed by two methods to perform text recognition, i.e., ABBYY commercial OCR engine and a state-of-the-art character recognition algorithm. Then, to perform textual cue encoding, bi- and trigrams are formed between the recognized characters by considering the proposed spatial pairwise constraints. Finally, extracted visual and textual cues are combined for fine-grained classification. The proposed method is validated on four publicly available data sets: ICDAR03, ICDAR13, Con-Text, and Flickr-logo. We improve the state-of-the-art end-to-end character recognition by a large margin of 15% on ICDAR03. We show that textual cues are useful in addition to visual cues for fine-grained classification. We show that textual cues are also useful for logo retrieval. Adding textual cues outperforms visual- and textual-only in fine-grained classification (70.7% to 60.3%) and logo retrieval (57.4% to 54.8%).

KW - fine-grained classification

KW - logo-retrieval

KW - Multimodal fusion

KW - text detection

KW - text saliency

UR - http://www.scopus.com/inward/record.url?scp=85028468300&partnerID=8YFLogxK

U2 - 10.1109/TIP.2017.2707805

DO - 10.1109/TIP.2017.2707805

M3 - Article

AN - SCOPUS:85028468300

SN - 1057-7149

VL - 26

SP - 3965

EP - 3980

JO - IEEE Transactions on Image Processing

JF - IEEE Transactions on Image Processing

IS - 8

M1 - 7933250

ER -

Con-Text: Text Detection for Fine-Grained Object Classification

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this