Overview of the new neural network system in Tesseract 4.00

Introduction

Tesseract 4.00 includes a new neural network subsystem configured as a textline recognizer. It has its origins in OCRopus’ Python-based LSTM implementation, but has been totally redesigned for Tesseract in C++. The neural network system in Tesseract pre-dates TensorFlow, but is compatible with it, as there is a network description language called Variable Graph Specification Language (VGSL), that is also available for TensorFlow. See https://github.com/tensorflow/models/tree/master/research/street

The idea of VGSL is that it is possible to build a neural network and train it without having to learn a lot of anything. There is no need to learn Python, TensorFlow, or even write any C++ code. It is merely required to understand the VGSL specification language well enough to build syntactically correct network descriptions. Some basic knowledge of what the various neural network layer types are and how they are combined will go a very long way.

Integration with Tesseract

The Tesseract 4.00 neural network subsystem is integrated into Tesseract as a line recognizer. It can be used with the existing layout analysis to recognize text within a large document, or it can be used in conjunction with an external text detector to recognize text from an image of a single textline.

The neural network engine is the default for 4.00. To recognize text from an image of a single text line, use SetPageSegMode(PSM_RAW_LINE). This can be used from the command-line with -psm 13

The neural network engine has been integrated to enable the multi- language mode that worked with Tesseract 3.04, but this will be improved in a future release. Vertical text is now supported for Chinese, Japanese and Korean, and should be detected automatically.

Hardware and CPU Requirements

The Tesseract 4.00 neural network subsystem is heavily compute-intensive, using the order of ten times the CPU resources of the base Tesseract, but the impact is mitigated, if your platform supports it, as follows:

On a machine with multiple cores, and AVX, an easy English image may take twice as much real time, and use 7 times the CPU as base Tesseract, whereas Hindi takes more CPU than base Tesseract, but actually runs faster in terms of real time.

If the above components are missing, there is a slower plain C++ implementation that enables the code to still work.

Little-endian and big-endian systems are both supported.

For Open Source Contributors

The initial implementation lacks the following:

Basics of the Implementation

All network layer types are derived from the Network base-class. The Plumbing sub-class is a base-class for layers that manipulate other layers in some way, e.g. by reshaping their input/output or organising a group of layers.

The input/output data “Tensor” is NetworkIO and the weights are stored in a WeightMatrix, both of which contain a Tesseract GENERIC_2D_ARRAY to hold the data. LSTMRecognizer provides the higher-level abstraction of converting an image of a textline to a sequence of tesseract WERD_RES classes. LSTMTrainer likewise handles the abstraction of training a network on an image of a textline that has a UTF-8 string ‘truth’. NetworkBuilder takes responsibility for converting the VGSL specification language to a graph of network elements.

Adding a new Layer Type

A new layer class must derive from Network or Plumbing and implement at least the following virtual methods:

Layers that have weights must also implement Update to update the weights with a set of gradients. There are quite a few other methods that may need to be implemented, depending on the specific requirements of the new layer. See network.h for more information on the methods that may require implementing.