VGSL 规格 - 用于图像的混合卷积/LSTM 网络的快速原型

可变大小图形规范语言 (VGSL) 允许从非常短的定义字符串中指定一个由卷积和 LSTM 组成的神经网络，该网络可以处理可变大小的图像。

应用：VGSL 规格有什么用？

VGSL 规格专门用于创建用于

可变大小的图像作为输入。（在一个或两个维度上！）
输出图像（热图）、序列（如文本）或类别。
卷积和 LSTM 是主要的计算组件。
固定大小的图像也可以！

模型字符串输入和输出

神经网络模型由一个字符串描述，该字符串描述了输入规范、输出规范以及它们之间的层规范。例如

[1,0,0,3 Ct5,5,16 Mp3,3 Lfys64 Lfx128 Lrx128 Lfx256 O1c105]

前 4 个数字指定输入的大小和类型，并遵循 TensorFlow 对图像张量的约定：[批次、高度、宽度、深度]。批次当前被忽略，但最终可能会用于指示训练小批量大小。高度和/或宽度可以为零，允许它们可变。高度和/或宽度为非零值表示所有输入图像应为该大小，并且如果需要，将被弯曲以适应。深度对于灰度图像需要为 1，对于彩色图像需要为 3。作为一个特殊情况，深度值为不同值，高度为 1 会导致图像从输入被视为垂直像素条的序列。请注意，在整个过程中，x 和 y 与传统数学相反，以使用与 TensorFlow 相同的约定。TF 采用这种约定的原因是为了消除在输入时转置图像的需要，因为图像中相邻的内存位置增加 x，然后是 y，而 TF 中的张量和 tesseract 中的 NetworkIO 中的相邻内存位置首先增加最右边的索引，然后是下一个左边的索引，依此类推，就像 C 数组一样。

最后一个“单词”是输出规范，形式为

O(2|1|0)(l|s|c)n output layer with n classes.
  2 (heatmap) Output is a 2-d vector map of the input (possibly at
    different scale). (Not yet supported.)
  1 (sequence) Output is a 1-d sequence of vector values.
  0 (category) Output is a 0-d single vector value.
  l uses a logistic non-linearity on the output, allowing multiple
    hot elements in any output vector value. (Not yet supported.)
  s uses a softmax non-linearity, with one-hot output in each value.
  c uses a softmax with CTC. Can only be used with s (sequence).
  NOTE Only O1s and O1c are currently supported.

类别的数量被忽略（仅为与 TensorFlow 保持一致），因为实际数量取自 unicharset。

中间层的语法

请注意，所有操作的输入和输出都是标准 TF 约定的 4 维张量：[batch, height, width, depth] 无论维度是否被折叠。这极大地简化了事情，并允许 VGSLSpecs 类跟踪宽度和高度值的更改，以便它们可以被正确地传递到 LSTM 操作中，并被任何下游 CTC 操作使用。

注意：在下面的描述中，<d> 是一个数值，文字使用正则表达式语法描述。

注意：操作之间允许使用空格。

功能性操作

C(s|t|r|l|m)<y>,<x>,<d> Convolves using a y,x window, with no shrinkage,
  random infill, d outputs, with s|t|r|l|m non-linear layer.
F(s|t|r|l|m)<d> Fully-connected with s|t|r|l|m non-linearity and d outputs.
  Reduces height, width to 1. Connects to every y,x,depth position of the input,
  reducing height, width to 1, producing a single <d> vector as the output.
  Input height and width *must* be constant.
  For a sliding-window linear or non-linear map that connects just to the
  input depth, and leaves the input image size as-is, use a 1x1 convolution
  eg. Cr1,1,64 instead of Fr64.
L(f|r|b)(x|y)[s]<n> LSTM cell with n outputs.
  The LSTM must have one of:
    f runs the LSTM forward only.
    r runs the LSTM reversed only.
    b runs the LSTM bidirectionally.
  It will operate on either the x- or y-dimension, treating the other dimension
  independently (as if part of the batch).
  s (optional) summarizes the output in the requested dimension, outputting
    only the final step, collapsing the dimension to a single element.
LS<n> Forward-only LSTM cell in the x-direction, with built-in Softmax.
LE<n> Forward-only LSTM cell in the x-direction, with built-in softmax,
  with binary Encoding.

在上述中，(s|t|r|l|m) 指定非线性的类型

s = sigmoid
t = tanh
r = relu
l = linear (i.e., No non-linearity)
m = softmax

示例

Cr5,5,32 运行一个 5x5 Relu 卷积，深度/滤波器数量为 32。

Lfx128 在 x 维度上运行一个仅向前 LSTM，输出为 128，将 y 维度独立处理。

Lfys64 在 y 维度上运行一个仅向前 LSTM，输出为 64，将 x 维度独立处理，并将 y 维度折叠为 1 个元素。

管道操作

管道操作允许构建任意复杂的图形。目前缺少的功能是定义宏以在多个地方生成例如初始单元。

[...] Execute ... networks in series (layers).
(...) Execute ... networks in parallel, with their output concatenated in depth.
S<y>,<x> Rescale 2-D input by shrink factor y,x, rearranging the data by
  increasing the depth of the input by factor xy.
  **NOTE** that the TF implementation of VGSLSpecs has a different S that is
  not yet implemented in Tesseract.
Mp<y>,<x> Maxpool the input, reducing each (y,x) rectangle to a single value.

完整示例：一个能够进行高质量 OCR 的一维 LSTM

[1,1,0,48 Lbx256 O1c105]

作为层描述：（输入层位于底部，输出层位于顶部。）

O1c105: Output layer produces 1-d (sequence) output, trained with CTC,
  outputting 105 classes.
Lbx256: Bi-directional LSTM in x with 256 outputs
1,1,0,48: Input is a batch of 1 image of height 48 pixels in greyscale, treated
  as a 1-dimensional sequence of vertical pixel strips.
[]: The network is always expressed as a series of layers.

只要输入图像在垂直方向上被仔细归一化，基线和中线位于恒定位置，该网络就可以很好地用于 OCR。

完整示例：一个能够进行高质量 OCR 的多层 LSTM

[1,0,0,1 Ct5,5,16 Mp3,3 Lfys64 Lfx128 Lrx128 Lfx256 O1c105]

作为层描述：（输入层位于底部，输出层位于顶部。）

O1c105: Output layer produces 1-d (sequence) output, trained with CTC,
  outputting 105 classes.
Lfx256: Forward-only LSTM in x with 256 outputs
Lrx128: Reverse-only LSTM in x with 128 outputs
Lfx128: Forward-only LSTM in x with 128 outputs
Lfys64: Dimension-summarizing LSTM, summarizing the y-dimension with 64 outputs
Mp3,3: 3 x 3 Maxpool
Ct5,5,16: 5 x 5 Convolution with 16 outputs and tanh non-linearity
1,0,0,1: Input is a batch of 1 image of variable size in greyscale
[]: The network is always expressed as a series of layers.

总结 LSTM 使该网络对文本位置的垂直变化更具弹性。

可变大小的输入和总结 LSTM

请注意，目前将未知大小的维度折叠为已知大小（1）的唯一方法是使用总结 LSTM。一个总结 LSTM 将折叠一个维度（x 或 y），留下一个一维序列。然后可以将一维序列在另一个维度上折叠，以形成一个 0 维类别（softmax）或嵌入（逻辑）输出。

因此，对于 OCR 目的，输入图像的高度必须是固定的，并且在垂直方向上被顶部层缩放（使用 Mp 或 S）为 1，或者允许可变高度的图像，必须使用总结 LSTM 将垂直维度折叠为一个单一值。总结 LSTM 也可以与固定高度的输入一起使用。