Text Recognition In Images -An Era of Deep Learning survey

Srilalitha Veerubhotla
6 min readMay 12, 2020

Scene texts contains more semantic information which has increased attention in recent years.Basically text identification has brought enormous applications in computer vision as well as natural language processing fields.Techniques developed in the recent times are addressing traditional problems and identifying new applications. I am interested to survey on the approaches and state-of-art methods that are developed from past couple of years in the Text Recognition system. This article gives you an understanding on following

  1. Work on a High-Level
  2. Previous techniques
  3. Recent advances
  4. Popular Datasets available
  5. Future areas
Linguistic Text Identification in Image

Briefing of the work:

Techniques used to detect and recognize the text in scene images and videos are categorized into 3

Text Detection and Localization: The technique helps in finding out whether a text is present in the given scene image or video , if present it identifies the location of the text.

Challenges: Different orientation, different languages, colors and sizes, complex background occlusion, blur, noise non-uniform illumination

Text Recognition: Text recognition aims at converting the localized text in images into character coding.

Challenges :

  1. Scene Complexity: Images or videos generally suffer from noise, distortion, non-uniform illumination, partial occlusion, as well as confusion of the text and background. Complex background brings some obstacles to text detection or recognition in real world.
  2. Text Diversity and more stringent practical requirements: Scene text vary in color, size, orientation, font, language, and text partial deletion, etc.

End-to-End Text Recognition: This combines both text detection and localization with recognition.

Classical Previous Methods and their adapted Methodologies

  • Photo OCR system for text extraction
  1. Text detection
  2. Segmentation (Niblack Binarization( a morphological approach) ,Binary sliding window classifier)
  3. Beam search
  4. character classifier ( A full connected network with Relu Units)
  5. Language Modeling ( A standard N-gram model)
  • Real-Time lexicon Free scene Text localization and recognition
  1. Character Detection: (Extremal Regions), Incrementally Computable Descriptors ( Area, Bounding Box, Perimeter), Sequential classifier ( firstly-a real Ada-boost classifier with decision trees, secondly — SVM+ RBF kernel)
  2. Text Line Formation
  3. Character Recognition
  4. Sequence Selection
  • RRN with Attention Modeling for OCR in the wild
  1. Character Sequence Model ( encodes image features using Recursive Convolution layers decodes text using RNN’s)
  2. RNN with attention function.

Recent Advances

Text Detection and localization: This mainly focuses on processing the images identifying the text and its positions. This is broadly categorized in to below three forms..

connected component (CC)-based methods: Find the smaller components and combine it to one larger and filter out non text components using classifier. Finally extract the text from images and combine into one text region. An arguable limitations for the above methodology are Handling rotation scale changes and complex backgrounds.

Overcoming techniques :

  1. Maximally stable extremal Regions(MSER): It provides robustness for geometric and illumination conditions. Whereas it only adapting to horizontal texts is its disadvantage.
  2. Stroke Width Transformer (SWT): It seems very efficient and has advantages of detecting text in any fonts, languages. Insensitive to directions and multi-scales.

Texture-based methods: Idea is finding the text in images which has distinct textual properties which can be separated from background.Below techniques are mostly used in this methods.

  1. Gabor filters
  2. Wavelet Transformation
  3. Fast Fourier transformation
  4. Discrete cosine Transform(DCT) Domain
  5. Laplacian Wavelet, Wavelet decomposition
  6. symmetry-based text line detector under the observation of the symmetry and self- similarity properties of character groups.

Deep Learning-based methods: CNN’s has entirely changed, widely explored and answered the unresolved questions. Main advantage hit by the CNN’s is with less computational costs able to extract the features from the images directly.Advanced properties of CNN has helped a lot in scene detection in natural images.CNN’s implementations can be broadly fall into 3 different groups within this deep learning-based methods they are below…

  1. Region Proposal Based Methods: Simple CNN’s with MSER , R-CNN’s and their advances— Its approach is to instance segmentation. A Masked R-CNN uses a Bounding box to generate the object segmentation by a shaded mask also called as semantic segmentation technique. A Faster R-CNN can process a classification and detection of objects in images. R-CNN’s uses a bounding box detection thus creating the boxes around the objects in the images. By using Regional proposal network an attention mechanism can happen in faster R-CNN in 2 stages.These bounding box and determining regions of interest using RPN protocol for each ROI we define the class label called ROI Pooling.A Pixel-by-Pixel fully convolutional networks can also be categorized in this method.
  2. Segmentation-Based Methods: These mainly focus on producing more precise multi size text detection but ineffective in detecting individual words. Like a Cascaded two-convolutional network + TextSegNet + YOLO(WordDetNet). Also a super pixel segmentation with hierarchical clustering for new character candidate extraction method also comes in segmentation technique.
  3. Hybrid Methods Using Multitask Learning: It accounts both CC based methods and texture based methods.Below techniques can account for..

a. Character candidate detection ( cascade boosting technique) , min-cost flow network (False character candidate removal, text line extraction, text line verification).

b. Connectionist Text Proposal Network which can be extended to multilingual and multi-scale text detection (vertical anchor mechanism, in-network RNN)

c. Text-attentional convolutional neural network( Text-CNN)-contrast-enhanced MSER.

Text Recognition: Identifying and understanding the text in the candidate region. This is mainly classified into 3 ways as below..

Character Based Methods :

1. “Strokelets” -whose essence is a set of multi-scale mid-level primitives and can be automatically learned from bounding box labels. Its very good in describing the characters.

2. Character Recognition by extracting low level features and integrated automatically via region-based feature pool- ing technique.

Word Based Methods-Recognizing text at word level :

  1. Dense SIFT in a bag-of-key points framework, character could be recognized robustly
  2. word segmentation with recognition in the probabilistic framework,Lexical decision and sparse beam search tools were used to improve the recognition accuracy.

Sequence-Based Methods : text is represented in character sequences.

  1. An irregular text recognition which is called RARE (Robust text recognizer with Automatic Rectification). This model combined spatial transformer network (STN) and a sequence recognition network (SRN)
  2. Lexicon-free photo-OCR system called recursive recurrent neural networks with attention modeling (R2AM)
  3. Convolutional recurrent neural network (CRNN)
  4. Auxiliary dense character detection model and an attention model.

End-to-End Text recognition System: CNN’s drastically changed the procedures and techniques of attempting for combined model which is one stop place of doing Text detection and text recognition. Text boxes are been developed as part of advances in preposed en-to-end models. A sliding window and connected component methods are best proposed techniques , where they proposed an unconstrained end-to-end real-time text localization and recognition method.

Benchmark Datasets for scene text identification in Images are below:

Future areas:

We have witnessed numerous approaches under different categories and methods with the same rapid development, especially CNN+RNN framework is quite popular. Now, lets discuss some future exciting stuff..

  1. Complex scene and large-scale dataset: you can refer COCO dataset and try to build multi challenging cases
  2. Multilingual detection and recognition : Identifying multiple languages like scripts identification within broad backgrounds and text sizes project is one of my aspiring areas
  3. Real-time detection and recognition: Identifying the texts from images on real time is also a greatest focussing areas.

--

--