man(1) Manual page archive


     OCR(1)                 (cetus,hydra,coma)                  OCR(1)

     NAME
          ocr - optical character recognition

     SYNOPSIS
          ocr [ option ... ] [ file ]

     DESCRIPTION
          Ocr reads a black-and-white image of a page from file, and
          writes ASCII to the standard output.  If no file is speci-
          fied, it reads from the standard input.

          The input is a picfile(5) image of one column of machine-
          printed text, normally scanned in by cscan(1). Fonts, sizes,
          and line-spacings may vary within the column, but each line
          should have a constant text size and baseline.  Lines should
          be parallel and roughly horizontal.

          In the output, white space approximates the original page
          layout.  Words that spell(1) are preferred, and hyphenations
          across lines are recombined.

          The options are:

          -as    The alphabet is the union of symbol sets selected by
                 characters in string s, from among:

                 A  ABCDEFGHIJKLMNOPQRSTUVWXYZ
                 a  abcdefghijklmnopqrstuvwxyz
                 0  0123456789
                 .  .,-:;*'"?!/&$()[]#@%         (basic punctuation)
                 ^  ^~`\|{}_                     (extended punct'n)
                 +  +-*/<>=.Ee[]                 (numerical punct'n)
                 s  \(sc\(dg\(dd\(ct\(bu\(co ... (selected non-ASCII)
                 l  fi fl ff ffi ffl ae oe ...   (ligatures, digraphs)
                 g  \(*a\(*b\(*g\(*d\(*e\(*z ... (Greek lower case)
                 G  AB\(*G\(*DEZ ...             (Greek upper case)

                 The default is -aAa0.+^, the full printable-ASCII
                 set, which may be abbreviated as -ap.  Thus, -apslgG
                 selects all of the above.

          -c     Find columns in complex nested layouts using greedy
                 white covers algorithm.

          -ml[,r]
                 Trim the left and right margins of the image by l and
                 r inches, respectively, before looking for columns.
                 If r is omitted, it is assumed to equal l.

          -nn    Find the n largest columns by analysis of a single

     OCR(1)                 (cetus,hydra,coma)                  OCR(1)

                 vertical projection.  Each column should be
                 compactly-printed and separated from the others by at
                 least 2 ems of horizontal white space.

          -pn,m  Point sizes lie in the range [ n, m ]; other sizes
                 are discarded.  The default is -p6,24.

          -s     Defeat spelling check (but continue to favor numeric
                 strings and good punctuation).

          -t     Write troff(1) format.  Each column is shown on a
                 separate page, lines at their original height, words
                 at their original horizontal location, and characters
                 roughly original size in Times roman.  Hyphenated
                 words are not recombined.

          -u     Unspellable words are prefixed with `?' or, if -t is
                 specified, printed boldface.

          -ww    Find the largest column of width w inches, within a
                 single vertical projection.

        Fonts
          Trained on over 100 Latin-alphabet book fonts in various
          italic, bold, etc styles.  Only one font of Greek, without
          diacriticals.  Also Swedish and Tibetan, on request.

     SEE ALSO
          bcp(1), cscan(1), font(6), picfile(5), spell(1), troff(1)

     BUGS
          For best results, use images of high-contrast, cleanly-
          printed original documents digitized at a resolution of 400
          pixels/inch or higher.  It may help to restrict the alphabet
          and sizes to what's there.