从 Git 安装 Tesseract

使用 Autoconf 工具安装

要做到这一点，你必须安装 automake、libtool、leptonica、make 和 pkg-config。此外，你还需要 Git 和一个 C++ 编译器。

在 Debian 或 Ubuntu 上，你可能可以像这样安装所有必需的软件包

apt-get install automake ca-certificates g++ git libtool libleptonica-dev make pkg-config

可选的帮助页使用 asciidoc 构建

apt-get install --no-install-recommends asciidoc docbook-xsl xsltproc

如果你也希望构建 Tesseract 训练工具，你还需要 Pango

apt-get install libpango1.0-dev

之后，要将 master 分支克隆到你的计算机上，请执行以下操作

git clone https://github.com/tesseract-ocr/tesseract.git

或者创建一个浅克隆，只将提交历史记录截断到最新的提交

git clone --depth 1  https://github.com/tesseract-ocr/tesseract.git

或者克隆不同的分支/版本

git clone https://github.com/tesseract-ocr/tesseract.git --branch <branchName> --single-branch

注意：你可能会在构建 GitHub 上的最新版本时遇到问题。如果是这样，请从这里下载最新发布的版本之一：https://github.com/tesseract-ocr/tesseract/releases。

注意：Tesseract 需要 Leptonica v1.74 或更高版本。如果你的系统只有旧版本的 Leptonica，你必须从 DanBloomberg/leptonica 上提供的源代码手动编译它。

最后，运行以下命令

    cd tesseract
    ./autogen.sh
    ./configure
    make
    sudo make install
    sudo ldconfig

重要：请参阅下面的 “安装后说明” 部分。

如果你收到此错误

make  all-recursive
Making all in ccstruct
/bin/sh ../libtool --tag=CXX   --mode=compile g++ -DHAVE_CONFIG_H -I. -
I..  -I../ccutil -I../cutil -I../image -I../viewer -I/opt/local/
include -I/usr/local/include/leptonica  -g -O2 -MT blobbox.lo -MD -MP -
MF .deps/blobbox.Tpo -c -o blobbox.lo blobbox.cpp
mv -f .deps/blobbox.Tpo .deps/blobbox.Plo
mv: rename .deps/blobbox.Tpo to .deps/blobbox.Plo: No such file or
directory
make[3]: *** [blobbox.lo] Error 1
make[2]: *** [all-recursive] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all] Error 2

尝试在运行 ./autogen.sh 后运行 autoreconf -i。

使用训练工具构建

以上操作不会构建 Tesseract 训练工具。如果你打算安装训练工具，你还需要以下库

sudo apt-get install libicu-dev
sudo apt-get install libpango1.0-dev
sudo apt-get install libcairo2-dev

要使用训练工具构建 Tesseract，请运行以下命令

    cd tesseract
    ./autogen.sh
    ./configure
    make
    sudo make install
    sudo ldconfig
    make training
    sudo make training-install

你可以在需要时为 configure 指定额外的选项，例如。

./configure --disable-openmp --disable-debug --disable-opencl --disable-graphics --disable-shared 'CXXFLAGS=-g -O2 -Wall -Wextra -Wpedantic'

安装后说明

Tesseract 的安装分为两个部分：引擎本身和语言训练数据。

上面的安装命令会安装 Tesseract 引擎和训练工具。它们还会安装配置文件，例如用于输出的配置文件，例如 pdf、tsv、hocr、alto，或用于创建框文件的配置文件，例如 lstmbox、wordstrbox。除了这些之外，还需要特定语言的训练数据来识别图像中的文本。

三种类型的训练数据文件（tessdata、tessdata_best 和 tessdata_fast）针对 130 多种语言和 35 种以上文字可供使用，这些文件位于 tesseract-ocr GitHub 仓库中。

从 Linux 上的源代码构建时，tessdata 配置文件将安装在 /usr/local/share/tessdata 中，除非你使用了 ./configure --prefix=/usr。Tesseract 安装完成后，不要忘记下载你需要的语言训练数据文件并将它们放到这个 tessdata 目录 (/usr/local/share/tessdata) 中。

如果你希望同时支持传统引擎 (–oem 0) 和 LSTM 引擎 (–oem 1)，请从 tessdata 下载训练数据文件。

如果你只希望支持 LSTM 引擎 (–oem 1)，请使用来自 tessdata_best 或 tessdata_fast 的训练数据文件。

请确保使用下载链接或使用 wget 获取 raw 文件，例如：以下是用于来自 tessdata 仓库的 eng.traineddata 的直接下载链接，它支持 tesseract 的传统引擎和 LSTM 引擎。

现在你已准备好使用 tesseract 了！

一个用于下载训练数据文件的 python3 脚本可从 https://github.com/zdenop/tessdata_downloader 获取

如果你希望将训练数据文件放在与安装期间定义的目录 (/usr/local/share/tessdata) 不同的目录中，则需要设置一个名为 TESSDATA_PREFIX 的本地变量，使其指向 tesseract tessdata 目录。

例如：在 Linux Ubuntu 上，通过在 ~/.bashrc 文件的底部添加以下内容来修改它。根据你的情况修改路径
```
    export TESSDATA_PREFIX="/home/$USER/Downloads/tesseract/tesseract-4.1.0/tessdata" 
```
然后，关闭并重新打开你的终端以使其生效，或者只需调用 . ~/.bashrc 或 export ~/.bashrc（相同操作）即可在当前终端中立即使其生效。
将你需要的任何语言训练数据也放到这个 tessdata 文件夹中。例如，英语训练数据名为 eng.traineddata。从这里的 tessdata 仓库下载它，并将其移动到你在上面的 TESSDATA_PREFIX 变量中指定的 tessdata 目录中。

使用 TensorFlow 构建

使用 TensorFlow 构建需要针对 Protocol Buffers 和 TensorFlow 的额外软件包。在 Debian 或 Ubuntu 上，你可能可以像这样安装它们

apt-get install libprotoc-dev libtensorflow-dev

如果找到了必要的开发文件，所有构建都会自动使用 TensorFlow 构建 Tesseract 和训练工具。这可以被覆盖

# Enforce build with TensorFlow (will fail if requirements are not met).
./configure --with-tensorflow [...]

# Don't build with TensorFlow.
./configure --without-tensorflow [...]

使用 TensorFlow 的构建支持是 Git master 中的新功能。生成的代码尚未经过测试。

单元测试构建

此类构建可用于运行自动回归测试，这需要额外的要求。这包括训练工具的额外依赖项（如上所述），以及下载所有 git 子模块，以及模型存储库 (*.traineddata)

# Clone the Tesseract source tree:
git clone https://github.com/tesseract-ocr/tesseract.git
# Clone repositories with model files (from the same directory):
git clone https://github.com/tesseract-ocr/tessdata.git
git clone https://github.com/tesseract-ocr/tessdata_best.git
git clone https://github.com/tesseract-ocr/tessdata_fast.git
git clone https://github.com/tesseract-ocr/langdata_lstm.git
# Change to the Tesseract source tree and get all submodules:
cd tesseract
git submodule update --init
# Build the training tools (see above). Here we use a release built with sanitizers:
./autogen.sh
mkdir -p bin/unittest
cd bin/unittest
../../configure --disable-shared 'CXXFLAGS=-g -O2 -Wall -Wextra -Wpedantic -fsanitize=address,undefined -fstack-protector-strong -ftrapv'
make training
# Run the unit tests:
make check
cd ../..

这将在 bin/unittest/unittest 下创建所有单元测试的日志文件，包括单个日志文件和累积日志文件。它们也可以独立运行，例如

bin/unittest/unittest/stringrenderer_test

失败的测试将以段错误或 SIGILL 处理程序（取决于平台）的形式突出显示。

调试构建

此类构建会生成运行速度非常慢的 Tesseract 二进制文件。它们不适合生产环境，但非常适合查找或分析软件问题。这是一个经过验证的构建序列

cd tesseract
./autogen.sh
mkdir -p bin/debug
cd bin/debug
../../configure --enable-debug --disable-shared 'CXXFLAGS=-g -O0 -Wall -Wextra -Wpedantic -fsanitize=address,undefined -fstack-protector-strong -ftrapv'
# Build tesseract and training tools. Run `make` if you don't need the training tools.
make training
cd ../..

这会激活调试代码，不使用共享 Tesseract 库（这使得无需安装即可运行 tesseract），禁用编译器优化（允许使用 gdb 进行更好的调试），启用大量编译器警告，并启用一些运行时检查。

分析构建

此类构建可用于调查性能问题。Tesseract 的运行速度会比没有分析时慢，但速度尚可接受。这是一个经过验证的构建序列

cd tesseract
./autogen.sh
mkdir -p bin/profiling
cd bin/profiling
../../configure --disable-shared 'CXXFLAGS=-g -p -O2 -Wall -Wextra -Wpedantic'
# Build tesseract and training tools. Run `make` if you don't need the training tools.
make training
cd ../..

这不会使用共享 Tesseract 库（这使得无需安装即可运行 tesseract），启用分析代码，启用编译器优化，并启用大量编译器警告。

可选地，还可以通过添加 --enable-debug 并将 -O2 替换为 -O0 来与调试代码一起使用。

分析代码在 Tesseract 终止时会在当前目录中创建一个名为 gmon.out 的文件。GNU gprof 用于显示该文件中的分析信息。

用于大规模生产的发布版构建

默认构建会创建一个 Tesseract 可执行文件，该文件非常适合处理单个图像。然后，Tesseract 使用 4 个 CPU 内核来尽可能快地获取 OCR 结果。

对于大规模生产，使用数百或数千个图像时，默认设置效果不佳，因为多线程执行的开销非常大。最好运行单线程的 Tesseract 实例，以便每个可用的 CPU 内核处理不同的图像。

这是一个经过验证的构建序列

cd tesseract
./autogen.sh
mkdir -p bin/release
cd bin/release
../../configure --disable-openmp --disable-shared 'CXXFLAGS=-g -O2 -fno-math-errno -Wall -Wextra -Wpedantic'
# Build tesseract and training tools. Run `make` if you don't need the training tools.
make training
cd ../..

这会禁用 OpenMP（多线程），不使用共享 Tesseract 库（这使得无需安装即可运行 tesseract），启用编译器优化，禁用为数学函数设置 errno（更快的执行！），并启用大量编译器警告。

用于模糊测试的构建

模糊测试用于测试 Tesseract API 中是否存在错误。Tesseract 使用 OSS-Fuzz，但模糊测试也可以在本地运行。需要一个较新的 Clang++ 编译器。

构建示例（修复 CXX 的值以适合可用的 clang++）

cd tesseract
./autogen.sh
mkdir -p bin/fuzzer
cd bin/fuzzer
../../configure --disable-openmp --disable-shared CXX=clang++-7 CXXFLAGS='-g -O2 -Wall -Wextra -Wpedantic -D_GLIBCXX_DEBUG -fsanitize=fuzzer-no-link,address,undefined'
# Build the fuzzer executable.
make fuzzer-api
cd ../..

示例（显示帮助信息）

bin/fuzzer/fuzzer-api -help=1

示例（使用已知测试用例运行模糊测试器）

bin/fuzzer/fuzzer-api clusterfuzz-testcase-minimized-fuzzer-api-5670045835853824

示例（运行模糊测试器以查找新错误）

nice bin/fuzzer/fuzzer-api -jobs=16 -workers=16

使用 Windows Visual Studio 构建

请参阅 Windows 编译。