跳到内容。

命令行使用

Tesseract 'man' 页面

查看 man 页面以获取命令行语法和其他详细信息。

常见问题解答

查看 FAQ 以获取更多示例和提示。


Tesseract 5 中可用的 OCR 引擎

使用 --oem 1 用于 LSTM/神经网络,--oem 0 用于传统 Tesseract。

请注意,传统 Tesseract 模型仅包含在来自 tessdata 存储库的训练数据文件中。

tesseract input.tiff output --oem 1 -l eng


最简单的 OCR 图像调用

tesseract imagename outputbase

这使用 **英语** 作为默认语言,并将 3 作为页面分割模式。默认输出格式为 **文本**。

osd.traineddata(用于方向和分割)以及 eng.traineddata 和其他用于英语的语言数据文件应位于“tessdata”目录中。TESSDATA_PREFIX 环境变量应设置为“tessdata”目录的父目录。

如果 eng.traineddata 和 osd.traineddata 文件位于 /usr/share/tessdata 目录中,则以下命令将产生与上面相同的結果。

tesseract --tessdata-dir /usr/share imagename outputbase -l eng --psm 3

以下示例使用此图像,其中包含多种语言的文本。

eurotext.png

使用一种语言

在命令中添加 ' -l LANG',其中 LANG 是支持语言列表中的三个字母语言代码。如果没有指定,则默认情况下假定为英语。

tesseract images/eurotext.png - -l eng

输出

The (quick) [brown] {fox} jumps!
Over the $43,456.78 <lazy> #90 dog
& duck/goose, as 12.5% of E-mail
from [email protected] is spam.
Der ,schnelle” braune Fuchs springt
iiber den faulen Hund. Le renard brun
«rapide» saute par-dessus le chien
paresseux. La volpe marrone rapida
salta sopra il cane pigro. El zorro
marrén rapido salta sobre el perro
perezoso. A raposa marrom ripida
salta sobre o cdo preguigoso.

使用多种语言

在命令行中添加 -l LANG[+LANG] 以将多种语言一起用于识别

tesseract images/eurotext.png - -l eng+deu

输出

The (quick) [brown] {fox} jumps!
Over the $43,456.78 <lazy> #90 dog
& duck/goose, as 12.5% of E-mail
from [email protected] is spam.
Der „schnelle” braune Fuchs springt
über den faulen Hund. Le renard brun
«rapide» saute par-dessus le chien
paresseux. La volpe marrone rapida
salta sopra il cane pigro. El zorro
marrén rapido salta sobre el perro
perezoso. A raposa marrom räpida
salta sobre o cdo preguigoso.

多种语言的顺序

OCR 所花费的时间以及输出可能根据语言顺序而有所不同。

以下示例使用此图像,其中包含多种语言的文本 - 印地语和英语。

bilingual.png

使用英语作为主要语言,然后是印地语

time tesseract images/bilingual.png - -l eng+hin

Estimating resolution as 638
हिंदी से अंग्रेजी
HINDI TO
ENGLISH

real    0m0.442s
user    0m0.622s
sys     0m0.062s

使用印地语作为主要语言,然后是英语

time tesseract images/bilingual.png - -l hin+eng

Estimating resolution as 638
हिंदी से अंग्रेजी
HINDI TO
ENGLISH

real    0m0.429s
user    0m0.550s
sys     0m0.074s

使用脚本/天城体作为主要语言(它支持所有天城体语言和英语)

time tesseract images/bilingual.png - -l script/Devanagari

Estimating resolution as 638
हिंदी से अंग्रेजी
HINDI TO
ENGLISH

real    0m0.391s
user    0m0.459s
sys     0m0.093s

使用 quiet 配置以抑制消息

在上述命令的末尾使用 quiet 将抑制有关图像分辨率的消息。

time tesseract images/bilingual.png - -l script/Devanagari quiet

हिंदी से अंग्रेजी
HINDI TO
ENGLISH

real    0m0.416s
user    0m0.494s
sys     0m0.091s

可搜索的 PDF 输出

tesseract testing/eurotext.png testing/eurotext-eng -l eng pdf

这将创建一个包含图像和具有识别文本的单独可搜索文本层的 PDF。


HOCR 输出

通过在命令末尾添加 hocr 来使用 'hocr' 配置文件以获得 HOCR 输出。

tesseract images/eurotext.png - -l eng hocr

部分输出

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 <head>
  <title></title>
  <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
  <meta name='ocr-system' content='tesseract 5.0.1-64-g3c22' />
  <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/>
 </head>
 <body>
  <div class='ocr_page' id='page_1' title='image "images/eurotext.png"; bbox 0 0 640 500; ppageno 0; scan_res 300 300'>
   <div class='ocr_carea' id='block_1_1' title="bbox 61 41 574 413">
    <p class='ocr_par' id='par_1_1' lang='eng' title="bbox 61 41 574 413">
     <span class='ocr_line' id='line_1_1' title="bbox 65 41 515 71; baseline 0.013 -11; x_size 25; x_descenders 5; x_ascenders 6">
      <span class='ocrx_word' id='word_1_1' title='bbox 65 41 111 61; x_wconf 96'>The</span>
      <span class='ocrx_word' id='word_1_2' title='bbox 128 42 217 66; x_wconf 95'>(quick)</span>
      <span class='ocrx_word' id='word_1_3' title='bbox 235 43 330 68; x_wconf 95'>[brown]</span>
      <span class='ocrx_word' id='word_1_4' title='bbox 349 44 415 69; x_wconf 94'>{fox}</span>
      <span class='ocrx_word' id='word_1_5' title='bbox 429 45 515 71; x_wconf 96'>jumps!</span>
     </span>

...

     <span class='ocr_line' id='line_1_12' title="bbox 61 385 444 413; baseline 0.013 -9; x_size 24; x_descenders 4; x_ascenders 5">
      <span class='ocrx_word' id='word_1_62' title='bbox 61 385 119 405; x_wconf 92'>salta</span>
      <span class='ocrx_word' id='word_1_63' title='bbox 135 385 200 406; x_wconf 92'>sobre</span>
      <span class='ocrx_word' id='word_1_64' title='bbox 216 392 229 406; x_wconf 83'>o</span>
      <span class='ocrx_word' id='word_1_65' title='bbox 244 388 285 407; x_wconf 80'>cdo</span>
      <span class='ocrx_word' id='word_1_66' title='bbox 300 388 444 413; x_wconf 92'>preguigoso.</span>
     </span>
    </p>
   </div>
  </div>
 </body>
</html>

TSV 输出

通过在命令末尾添加 tsv 来使用 'tsv' 配置文件以获得 TSV 输出。

tesseract images/eurotext.png - -l eng tsv

部分输出

level   page_num        block_num       par_num line_num        word_num        left    top     width   height  conf    text
1       1       0       0       0       0       0       0       640     500     -1
2       1       1       0       0       0       61      41      513     372     -1
3       1       1       1       0       0       61      41      513     372     -1
4       1       1       1       1       0       65      41      450     30      -1
5       1       1       1       1       1       65      41      46      20      96.063751       The
5       1       1       1       1       2       128     42      89      24      95.965691       (quick)
5       1       1       1       1       3       235     43      95      25      95.835831       [brown]
5       1       1       1       1       4       349     44      66      25      94.899742       {fox}
5       1       1       1       1       5       429     45      86      26      96.683357       jumps!
4       1       1       1       2       0       65      72      490     31      -1
5       1       1       1       2       1       65      72      60      20      96.912064       Over
5       1       1       1       2       2       140     73      37      20      96.887390       the
5       1       1       1       2       3       194     73      139     24      93.263031       $43,456.78
5       1       1       1       2       4       350     76      85      25      90.893219       <lazy>
5       1       1       1       2       5       451     77      44      19      96.820717       #90
5       1       1       1       2       6       511     78      44      25      96.538940       dog
4       1       1       1       3       0       64      103     458     26      -1

使用不同的页面分割模式

–psm 3 - 全自动页面分割,但没有 OSD。(默认)

以下示例使用此图像,其中包含多列文本。

2col.png

tesseract images/2col.png - --psm 3
Cautionary Statement

ON FORWARD-LOOKING STATEMENTS: This presentation includes
information, statements, beliefs and opinions which are forward-looking, and
which reflect current estimates, expectations and projections about future
events, referred to herein as “forward-looking statements” within the meaning
of the U.S> Private Securities Litigation Reform Act of 1995 or “forward-looking
information” under applicable securities laws. Statements containing the words
“believe”, “expect”, “continue”, “could, “potential”, “predict”, “would”,

“intend”, “should”, “seek”, “anticipate”, ‘will’, “opportunity,” “positioned”,

“poised,” “project”, “risk”, “plan”, “may”, “estimate” or, in each case, their

Historical Information: Historical statements contained in this document
regarding past trends or activities should not be taken as a representation that
such trends or activities will continue in the future. In this regard, certain
financial information contained herein has been extracted from, or based upon,
information available in the public domain and/or provided by the Company. In
particular, historical results should not be taken as a representation that such
trends will be replicated in the future. No statement in this document is
intended to be nor may be construed as a profit forecast.

–psm 6 - 假设单个统一文本块。

以下示例使用此图像,其中包含一个目录。

toc.png

tesseract images/toc.png - --psm 6
Contents

Introduction to the Tenth Anniversary Edition page xvii
Afterword to the Tenth Anniversary Edition xix
Preface xxi
Acknowledgements xxvii
Nomenclature and notation xxix
Part I Fundamental concepts 1
1 Introduction and overview 1
1.1 Global perspectives 1

1.11 History of quantum computation and quantum
information 2
1.1.2 Future directions 12
1.2 Quantum bits 13
1.2.1 Multiple qubits 16
1.3 Quantum computation 17
1.3.1 Single qubit gates 17
1.3.2 Multiple qubit gates 20
1.3.3 Measurements in bases other than the computational basis 2
1.34 Quantum circuits 2
1.3.5 Qubit copying circuit? 24

使用 -c preserve_interword_spaces=1 来保留空格

tesseract images/toc.png - --psm 6 -c preserve_interword_spaces=1
Contents

Introduction to the Tenth Anniversary Edition                                       page xvii
Afterword to the Tenth Anniversary Edition                                                 xix
Preface                                                                                                         xxi
Acknowledgements                                                                                    xxvii
Nomenclature and notation                                                                        xxix
Part I Fundamental concepts                                                                          1
1 Introduction and overview                                                                              1
1.1 Global perspectives                                                                                 1

1.11 History of quantum computation and quantum
information                                                                                 2
1.1.2 Future directions                                                12
1.2 Quantum bits                                                          13
1.2.1 Multiple qubits                                                     16
1.3 Quantum computation                                                  17
1.3.1 Single qubit gates                                                17
1.3.2 Multiple qubit gates                                                20
1.3.3 Measurements in bases other than the computational basis              2
1.34 Quantum circuits                                                            2
1.3.5 Qubit copying circuit?                                                  24

使用 pdftotext 来保留文本输出的布局

tesseract images/toc.png images/toc -l eng –psm 11 pdf

pdftotext -layout images/toc.pdf -

                                    Contents




Introduction to the Tenth Anniversary Edition                             page xvii
Afterword to the Tenth Anniversary Edition                                      xix
Preface                                                                         xxi
Acknowledgements                                                              xx
Nomenclature and notation                                                      xxix

Part I Fundamental concepts

 1 Introduction and overview
    1.1 Global perspectives
          1.11 History of quantum computation and quantum
                information
          1.1.2 Future directions                                                12
    12 Quantum bits                                                              13
         1.2.1 Multiple qubits                                                   16
    1.3 Quantum computation                                                      17
               Single qubit gates
             2 Multiple qubit gates                                              20
               Measurements in bases other than the computational basis          2
             4 Quantum circu                                                     2
             5 Qubit copying circuit?                                            24