VGSL Specs - rapid prototyping of mixed conv/LSTM networks for images
Variable-size Graph Specification Language (VGSL) enables the specification of a neural network, composed of convolutions and LSTMs, that can process variable-sized images, from a very short definition string.
Applications: What is VGSL Specs good for?
VGSL Specs are designed specifically to create networks for:
- Variable size images as the input. (In one or BOTH dimensions!)
- Output an image (heat map), sequence (like text), or a category.
- Convolutions and LSTMs are the main computing component.
- Fixed-size images are OK too!
Model string input and output
A neural network model is described by a string that describes the input spec, the output spec and the layers spec in between. Example:
[1,0,0,3 Ct5,5,16 Mp3,3 Lfys64 Lfx128 Lrx128 Lfx256 O1c105]
The first 4 numbers specify the size and type of the input, and follow the TensorFlow convention for an image tensor: [batch, height, width, depth]. Batch is currently ignored, but eventually may be used to indicate a training mini-batch size. Height and/or width may be zero, allowing them to be variable. A non-zero value for height and/or width means that all input images are expected to be of that size, and will be bent to fit if needed. Depth needs to be 1 for greyscale and 3 for color. As a special case, a different value of depth, and a height of 1 causes the image to be treated from input as a sequence of vertical pixel strips. NOTE THAT THROUGHOUT, x and y are REVERSED from conventional mathematics, to use the same convention as TensorFlow. The reason TF adopts this convention is to eliminate the need to transpose images on input, since adjacent memory locations in images increase x and then y, while adjacent memory locations in tensors in TF, and NetworkIO in tesseract increase the rightmost index first, then the next-left and so-on, like C arrays.
The last “word” is the output specification and takes the form:
O(2|1|0)(l|s|c)n output layer with n classes. 2 (heatmap) Output is a 2-d vector map of the input (possibly at different scale). (Not yet supported.) 1 (sequence) Output is a 1-d sequence of vector values. 0 (category) Output is a 0-d single vector value. l uses a logistic non-linearity on the output, allowing multiple hot elements in any output vector value. (Not yet supported.) s uses a softmax non-linearity, with one-hot output in each value. c uses a softmax with CTC. Can only be used with s (sequence). NOTE Only O1s and O1c are currently supported.
The number of classes is ignored (only there for compatibility with TensorFlow) as the actual number is taken from the unicharset.
Syntax of the Layers in between
NOTE that all ops input and output the standard TF convention of a 4-d tensor:
[batch, height, width, depth] regardless of any collapsing of dimensions.
This greatly simplifies things, and allows the VGSLSpecs class to track changes
to the values of widths and heights, so they can be correctly passed in to LSTM
operations, and used by any downstream CTC operation.
NOTE: in the descriptions below,
<d> is a numeric value, and literals are
described using regular expression syntax.
NOTE: Whitespace is allowed between ops.
C(s|t|r|l|m)<y>,<x>,<d> Convolves using a y,x window, with no shrinkage, random infill, d outputs, with s|t|r|l|m non-linear layer. F(s|t|r|l|m)<d> Fully-connected with s|t|r|l|m non-linearity and d outputs. Reduces height, width to 1. Connects to every y,x,depth position of the input, reducing height, width to 1, producing a single <d> vector as the output. Input height and width *must* be constant. For a sliding-window linear or non-linear map that connects just to the input depth, and leaves the input image size as-is, use a 1x1 convolution eg. Cr1,1,64 instead of Fr64. L(f|r|b)(x|y)[s]<n> LSTM cell with n outputs. The LSTM must have one of: f runs the LSTM forward only. r runs the LSTM reversed only. b runs the LSTM bidirectionally. It will operate on either the x- or y-dimension, treating the other dimension independently (as if part of the batch). s (optional) summarizes the output in the requested dimension, outputting only the final step, collapsing the dimension to a single element. LS<n> Forward-only LSTM cell in the x-direction, with built-in Softmax. LE<n> Forward-only LSTM cell in the x-direction, with built-in softmax, with binary Encoding.
In the above,
(s|t|r|l|m) specifies the type of the non-linearity:
s = sigmoid t = tanh r = relu l = linear (i.e., No non-linearity) m = softmax
Cr5,5,32 Runs a 5x5 Relu convolution with 32 depth/number of filters.
Lfx128 runs a forward-only LSTM, in the x-dimension with 128 outputs, treating
the y dimension independently.
Lfys64 runs a forward-only LSTM in the y-dimension with 64 outputs, treating
the x-dimension independently and collapses the y-dimension to 1 element.
The plumbing ops allow the construction of arbitrarily complex graphs. Something currently missing is the ability to define macros for generating say an inception unit in multiple places.
[...] Execute ... networks in series (layers). (...) Execute ... networks in parallel, with their output concatenated in depth. S<y>,<x> Rescale 2-D input by shrink factor y,x, rearranging the data by increasing the depth of the input by factor xy. **NOTE** that the TF implementation of VGSLSpecs has a different S that is not yet implemented in Tesseract. Mp<y>,<x> Maxpool the input, reducing each (y,x) rectangle to a single value.
Full Example: A 1-D LSTM capable of high quality OCR
[1,1,0,48 Lbx256 O1c105]
As layer descriptions: (Input layer is at the bottom, output at the top.)
O1c105: Output layer produces 1-d (sequence) output, trained with CTC, outputting 105 classes. Lbx256: Bi-directional LSTM in x with 256 outputs 1,1,0,48: Input is a batch of 1 image of height 48 pixels in greyscale, treated as a 1-dimensional sequence of vertical pixel strips. : The network is always expressed as a series of layers.
This network works well for OCR, as long as the input image is carefully normalized in the vertical direction, with the baseline and meanline in constant places.
Full Example: A multi-layer LSTM capable of high quality OCR
[1,0,0,1 Ct5,5,16 Mp3,3 Lfys64 Lfx128 Lrx128 Lfx256 O1c105]
As layer descriptions: (Input layer is at the bottom, output at the top.)
O1c105: Output layer produces 1-d (sequence) output, trained with CTC, outputting 105 classes. Lfx256: Forward-only LSTM in x with 256 outputs Lrx128: Reverse-only LSTM in x with 128 outputs Lfx128: Forward-only LSTM in x with 128 outputs Lfys64: Dimension-summarizing LSTM, summarizing the y-dimension with 64 outputs Mp3,3: 3 x 3 Maxpool Ct5,5,16: 5 x 5 Convolution with 16 outputs and tanh non-linearity 1,0,0,1: Input is a batch of 1 image of variable size in greyscale : The network is always expressed as a series of layers.
The summarizing LSTM makes this network more resilient to vertical variation in position of the text.
Variable size inputs and summarizing LSTM
NOTE that currently the only way of collapsing a dimension of unknown size to known size (1) is through the use of a summarizing LSTM. A single summarizing LSTM will collapse one dimension (x or y), leaving a 1-d sequence. The 1-d sequence can then be collapsed in the other dimension to make a 0-d categorical (softmax) or embedding (logistic) output.
For OCR purposes then, the height of the input images must either be fixed, and scaled (using Mp or S) vertically to 1 by the top layer, or to allow variable-height images, a summarizing LSTM must be used to collapse the vertical dimension to a single value. The summarizing LSTM can also be used with a fixed height input.