This is ocrad.info, produced by makeinfo version 4.13+ from ocrad.texi.

INFO-DIR-SECTION GNU Packages
START-INFO-DIR-ENTRY
* Ocrad: (ocrad).               The GNU OCR program
END-INFO-DIR-ENTRY


File: ocrad.info,  Node: Top,  Next: Introduction,  Up: (dir)

GNU Ocrad Manual
****************

This manual is for GNU Ocrad (version 0.25, 31 March 2015).

* Menu:

* Introduction::                Purpose and features of GNU Ocrad
* Character sets::              Input charsets and output formats
* Invoking ocrad::              Command line interface
* Filters::                     Postprocessing the produced text
* Library version::             Checking library version
* Library functions::           Descriptions of the library functions
* Library error codes::         Meaning of codes returned by functions
* Image format conversion::     How to convert other formats to pnm
* Algorithm::                   How ocrad does its job
* OCR results file::            Description of the ORF file format
* Problems::                    Reporting bugs
* Concept index::               Index of concepts


   Copyright (C) 2003-2015 Antonio Diaz Diaz.

   This manual is free documentation: you have unlimited permission to
copy, distribute and modify it.


File: ocrad.info,  Node: Introduction,  Next: Character sets,  Prev: Top,  Up: Top

1 Introduction
**************

GNU Ocrad is an OCR (Optical Character Recognition) program and library
based on a feature extraction method. It reads images in pbm (bitmap),
pgm (greyscale) or ppm (color) formats and produces text in
byte (8-bit) or UTF-8 formats. The pbm, pgm and ppm formats are
collectively known as pnm.

   Ocrad includes a layout analyser able to separate the columns or
blocks of text normally found on printed pages.

   For best results the characters should be at least 20 pixels high. If
they are smaller, try the '--scale' option. Scanning the image at 300
dpi usually produces a character size good enough for ocrad.


File: ocrad.info,  Node: Character sets,  Next: Invoking ocrad,  Prev: Introduction,  Up: Top

2 Character sets
****************

The character set internally used by ocrad is ISO 10646, also known as
UCS (Universal Character Set), which can represent over two thousand
million characters (2^31).

   As it is unpractical to try to recognize one among so many different
characters, you can tell ocrad what character sets to recognize. You do
this with the '--charset' option.

   If the input page contains characters from only one character set,
say 'ISO-8859-15', you can use the default 'byte' output format. But in
a page with 'ISO-8859-9' and 'ISO-8859-15' characters, you can't tell
if a code of 0xFD represents a 'latin small letter i dotless' or a
'latin small letter y with acute'. You should use '--format=utf8'
instead.
Of course, you may request UTF-8 output in any case.


   NOTE: 10^9 is a thousand millions, a billion is a million millions
(million^2), a trillion is a million million millions (million^3), and
so on. Please, don't "embrace and extend" the meaning of prefixes,
making communication among all people difficult. Thanks.


File: ocrad.info,  Node: Invoking ocrad,  Next: Filters,  Prev: Character sets,  Up: Top

3 Invoking ocrad
****************

The format for running ocrad is:

     ocrad [OPTIONS] [FILES]

   Ocrad supports the following options:

'-h'
'--help'
     Print an informative help message describing the options and exit.
     'ocrad --verbose --help' describes also hidden options.

'-V'
'--version'
     Print the version number of ocrad on the standard output and exit.

'-a'
'--append'
     Append generated text to the output file instead of overwriting it.

'-c NAME'
'--charset=NAME'
     Enable recognition of the characters belonging to the given
     character set.  You can repeat this option multiple times with
     different names for processing a page with characters from
     different character sets.
     If no charset is specified, 'iso-8859-15' (latin9) is assumed.
     Try '--charset=help' for a list of valid charset names.

'-e NAME'
'--filter=NAME'
     Pass the output text through the given built-in postprocessing
     filter (*note Filters::). Several filters can be applied in
     sequence using as many '--filter' and '--user-filter' options as
     needed. The filters are applied in the order they appear on the
     command line.
     Try '--filter=help' for a list of valid filter names.

'-E FILE'
'--user-filter=FILE'
     Pass the output text through the postprocessing filter defined in
     FILE. See the chapter 'Filters' (*note Filters::) for a
     description of the format of FILE. Several filters can be applied
     in sequence using as many '--filter' and '--user-filter' options
     as needed. The filters are applied in the order they appear on the
     command line.

'-f'
'--force'
     Force overwrite of output files.

'-F NAME'
'--format=NAME'
     Select the output format. The valid names are 'byte' and 'utf8'.
     If no output format is specified, 'byte' (8 bit) is assumed.

'-i'
'--invert'
     Invert image levels (white on black).

'-l'
'--layout'
     Enable page layout analysis. Ocrad is able to separate blocks of
     text of arbitrary shape as long as they are clearly delimited by
     white space.

'-o FILE'
'--output=FILE'
     Place the output into FILE instead of into the standard output.

'-q'
'--quiet'
     Quiet operation.

'-s VALUE'
'--scale=VALUE'
     Scale up the input image by VALUE before layout analysis and
     recognition. If VALUE is negative, the input image is scaled down
     by -VALUE.

'-t NAME'
'--transform=NAME'
     Perform given transformation (rotation or mirroring) on the input
     image before scaling, layout analysis and recognition. Rotations
     are made counter-clockwise.
     Try '--transform=help' for a list of valid transformation names.

'-T VALUE'
'--threshold=VALUE'
     Set binarization threshold for pgm or ppm files or for '--scale'
     option (only for scaled down images). VALUE should be a rational
     number between 0 and 1, and may be given as a percentage (50%), a
     fraction (1/2), or a decimal value (0.5). Image values greater than
     threshold are converted to white. The default value is 0.5.

'-u LEFT,TOP,WIDTH,HEIGHT'
'--cut=LEFT,TOP,WIDTH,HEIGHT'
     Cut the input image by the rectangle defined by LEFT, TOP, WIDTH
     and HEIGHT. Values may be relative to the image size
     (-1.0 <= value <= +1.0), or absolute (abs( value ) > 1).  Negative
     values of LEFT, TOP are relative to the right-bottom corner of the
     image. Values of WIDTH and HEIGHT must be positive. Absolute and
     relative values can be mixed. For example
     'ocrad --cut 700,960,1,1' will extract from '700,960' to the
     right-bottom corner of the image.
     The cutting is performed before any other transformation (rotation
     or mirroring) on the input image, and before scaling, layout
     analysis and recognition.

'-v'
'--verbose'
     Verbose mode.

'-x FILE'
'--export=FILE'
     Write (export) OCR results file to FILE (*note OCR results file::).
     '-x -' writes to stdout, overriding text output except if output
     has been also redirected with the '-o' option.


   Exit status: 0 for a normal exit, 1 for environmental problems (file
not found, invalid flags, I/O errors, etc), 2 to indicate a corrupt or
invalid input file, 3 for an internal consistency error (eg, bug) which
caused ocrad to panic.


File: ocrad.info,  Node: Filters,  Next: Library version,  Prev: Invoking ocrad,  Up: Top

4 Postprocessing the produced text
**********************************

Filters replace some characters in the text output with different
characters and remove some other characters from the output. For
example, when recognizing a text that is known to contain just numbers,
any character recognized as a 'Z' will probably be a '2'.

   Filters don't enable the recognition of characters, just filter them
from the output. Use '--charset' to enable the recognition of a
character set different from the default ISO-8859-15.

   Ocrad provides both built-in filters and user-defined filters.

4.1 User-defined filters
========================

The format of a user-defined filter file (*note --user-filter::) is very
simple. Each line contains either a character conversion or a word that
specifies the default behaviour for unlisted characters.

   A character conversion is a comma-separated list of quoted characters
('c'), character sets ([0-9A-Z]), character codes (U0063), or character
ranges (U0000 - UFFFF), and an optional conversion (an equal sign (=)
followed by a quoted character or a character code). The characters in
the list are converted to the character in the conversion. If no
conversion is specified, the character is left unmodified (converted to
itself).

   The default behaviour is to discard unlisted characters, i.e. those
characters not appearing in the file, either by themselves or included
in a set or range. If a line containing just the word 'leave' is found
in the file, unlisted characters are left unmodified. If the word is
'mark', unlisted characters are marked as unrecognized.

   The destination character of a conversion is considered as listed by
default. Every character may be listed more than once, even as part of
different conversions. The last conversion affecting a given character
is the one that is performed.

   Character sets and quoted characters may contain escape sequences.

   The character '#' at begin of line or after whitespace starts a
comment that extends to the end of the line.

   Ranges of characters may be specified in character sets by writing
the starting and ending characters with a '-' between them. Thus,
'[A-Z]' matches any ASCII uppercase letter. '-' may be specified by
placing it first or last. ']' may be specified by placing it first. If
the first character after the left bracket is '^', it indicates a
"complemented set", which matches any character except the ones between
the brackets.

   Literals (quoted characters and character sets) are decoded as
ISO-8859-15. Character codes are decoded as UCS2. Thus, a 'latin
capital letter y with diaeresis' is specified in a set as '[\xBE]', but
its code is 'U0178'.

   Spaces and control characters are unaffected by filters, except that
leadind, trailing, and duplicate spaces produced by the removal of other
characters will be themselves removed.

Here is an example user-defined filter file equivalent to the built-in
filter 'numbers':

     leave                      # remove this line to get 'numbers_only'
     'D', 'O', 'Q', 'o' = '0'
     'I', 'L', 'l', '|' = '1'
     'Z', 'z'           = '2'
                          '3'
     'A', 'q'           = '4'
     'S', 's'           = '5'
     'G', 'b', U00F3    = '6'   # latin small letter o with acute
     'J', 'T'           = '7'
     '&', 'B'           = '8'
     'g'                = '9'

4.2 Built-in filters
====================

Ocrad provides the following built-in filters (*note --filter::):

'--filter=letters'
     Forces every character that resembles a letter to be recognized as
     a letter. Other characters will be output without change.

'--filter=letters_only'
     Same as '--filter=letters', but other characters will be discarded.

'--filter=numbers'
     Forces every character that resembles a number to be recognized as
     a number. Other characters will be output without change.

'--filter=numbers_only'
     Same as '--filter=numbers' but other characters will be discarded.

'--filter=same_height'
     Discards any character (or noise) whose height differs in more
     than 10 percent from the median height of the characters in the
     line.

'--filter=text_block'
     Discards any character (or noise) outside of a rectangular block
     of text lines.

'--filter=upper_num'
     Forces every character that resembles a uppercase letter or a
     number to be recognized as such. Other characters will be output
     without change.

'--filter=upper_num_mark'
     Same as '--filter=upper_num', but other characters will be marked
     as unrecognized.

'--filter=upper_num_only'
     Same as '--filter=upper_num', but other characters will be
     discarded.



File: ocrad.info,  Node: Library version,  Next: Library functions,  Prev: Filters,  Up: Top

5 Library version
*****************

 -- Function: const char * OCRAD_version ( void )
     Returns the library version as a string.

 -- Constant: const char * OCRAD_version_string
     This constant is defined in the header file 'ocradlib.h'.

   The application should compare OCRAD_version and OCRAD_version_string
for consistency. If the first character differs, the library code
actually used may be incompatible with the 'ocradlib.h' header file
used by the application.

     if( OCRAD_version()[0] != OCRAD_version_string[0] )
       error( "bad library version" );


File: ocrad.info,  Node: Library functions,  Next: Library error codes,  Prev: Library version,  Up: Top

6 Library functions
*******************

These are the OCRAD library functions. In case of error, all of them
return -1 or a null pointer, except 'OCRAD_open' whose return value
must be verified by calling 'OCRAD_get_errno' before using it.

 -- Function: struct OCRAD_Descriptor * OCRAD_open ( void )
     Initializes the internal library state and returns a pointer that
     can only be used as the OCRDES argument for the other OCRAD
     functions, or a null pointer if the descriptor could not be
     allocated.

     The returned pointer must be verified by calling 'OCRAD_get_errno'
     before using it. If 'OCRAD_get_errno' does not return 'OCRAD_ok',
     the returned pointer must not be used and should be freed with
     'OCRAD_close' to avoid memory leaks.

 -- Function: int OCRAD_close ( struct OCRAD_Descriptor * const OCRDES )
     Frees all dynamically allocated data structures for this
     descriptor.  After a call to 'OCRAD_close', OCRDES can no more be
     used as an argument to any OCRAD function.

 -- Function: enum OCRAD_Errno OCRAD_get_errno ( struct
          OCRAD_Descriptor * const OCRDES )
     Returns the current error code for OCRDES (*note Library error
     codes::).

 -- Function: int OCRAD_set_image ( struct OCRAD_Descriptor * const
          OCRDES, const struct OCRAD_Pixmap * const IMAGE, const bool
          INVERT )
     Loads IMAGE into the internal buffer. If INVERT is true, image
     levels are inverted (white on black). Loading a new image deletes
     any previous text results.

 -- Function: int OCRAD_set_image_from_file ( struct OCRAD_Descriptor *
          const OCRDES, const char * const FILENAME, const bool INVERT )
     Loads a image from the file FILENAME into the internal buffer. If
     INVERT is true, image levels are inverted (white on black).
     Loading a new image deletes any previous text results.

 -- Function: int OCRAD_set_utf8_format ( struct OCRAD_Descriptor *
          const OCRDES, const bool UTF8 )
     Set the output format to 'byte' (if UTF8=false) or to 'utf8'. By
     default ocrad produces 'byte' (8 bit) output.

 -- Function: int OCRAD_set_threshold ( struct OCRAD_Descriptor * const
          OCRDES, const int THRESHOLD )
     Set binarization threshold for greymap or RGB images. THRESHOLD
     values between 0 and 255 set a fixed threshold. A value of -1 sets
     an automatic threshold. Pixel values greater than the resulting
     threshold are converted to white. The default threshold value if
     this function is not called is 127.

 -- Function: int OCRAD_scale ( struct OCRAD_Descriptor * const OCRDES,
          const int VALUE )
     Scale up the image in the internal buffer by VALUE. If VALUE is
     negative, the image is scaled down by -VALUE.

 -- Function: int OCRAD_recognize ( struct OCRAD_Descriptor * const
          OCRDES, const bool LAYOUT )
     Recognize the image loaded in the internal buffer and produce text
     results which can be later retrieved with the 'OCRAD_result'
     functions. The same image can be recognized as many times as
     desired, for example setting a new threshold each time for 3D
     greymap recognition. Every time this function is called, the
     produced text results replace any previous ones. If LAYOUT is
     true, page layout analysis is enabled, probably producing more
     than one text block.

 -- Function: int OCRAD_result_blocks ( struct OCRAD_Descriptor * const
          OCRDES )
     Returns the number of text blocks found in the image, or 0 if no
     text was found. The returned value is usually 1, but can be larger
     if layout analysis was requested.

 -- Function: int OCRAD_result_lines ( struct OCRAD_Descriptor * const
          OCRDES, const int BLOCKNUM )
     Returns the number of text lines contained in the given text block.

 -- Function: int OCRAD_result_chars_total ( struct OCRAD_Descriptor *
          const OCRDES )
     Returns the total number of text characters contained in the
     recognized image.

 -- Function: int OCRAD_result_chars_block ( struct OCRAD_Descriptor *
          const OCRDES, const int BLOCKNUM )
     Returns the number of text characters contained in the given text
     block.

 -- Function: int OCRAD_result_chars_line ( struct OCRAD_Descriptor *
          const OCRDES, const int BLOCKNUM, const int LINENUM )
     Returns the number of text characters contained in the given text
     line.

 -- Function: const char * OCRAD_result_line ( struct OCRAD_Descriptor
          * const OCRDES, const int BLOCKNUM, const int LINENUM )
     Returns the line of text specified by BLOCKNUM and LINENUM.

 -- Function: int OCRAD_result_first_character ( struct
          OCRAD_Descriptor * const OCRDES )
     Returns the byte result for the first character in the image.
     Returns 0 if the image has no characters or if the first character
     could not be recognized. This function is a convenient short cut
     to the result for images containing a single character.


File: ocrad.info,  Node: Library error codes,  Next: Image format conversion,  Prev: Library functions,  Up: Top

7 Library error codes
*********************

Most library functions return -1 or a null pointer to indicate that they
have failed. But this return value only tells you that an error has
occurred. To find out what kind of error it was, you need to verify the
error code by calling 'OCRAD_get_errno'.

   Library functions do not change the value returned by
'OCRAD_get_errno' when they succeed; thus, the value returned by
'OCRAD_get_errno' after a successful call is not necessarily OCRAD_ok,
and you should not use 'OCRAD_get_errno' to determine whether a call
failed. If the call failed, then you can examine 'OCRAD_get_errno'.

   The error codes are defined in the header file 'ocradlib.h'.

 -- Constant: enum OCRAD_Errno OCRAD_ok
     The value of this constant is 0 and is used to indicate that there
     is no error.

 -- Constant: enum OCRAD_Errno OCRAD_bad_argument
     At least one of the arguments passed to the library function was
     invalid.

 -- Constant: enum OCRAD_Errno OCRAD_mem_error
     No memory available. The system cannot allocate more virtual memory
     because its capacity is full.

 -- Constant: enum OCRAD_Errno OCRAD_sequence_error
     A library function was called in the wrong order. For example
     'OCRAD_result_line' was called before 'OCRAD_recognize'.

 -- Constant: enum OCRAD_Errno OCRAD_library_error
     A bug was detected in the library. Please, report it (*note
     Problems::).


File: ocrad.info,  Node: Image format conversion,  Next: Algorithm,  Prev: Library error codes,  Up: Top

8 Image format conversion
*************************

There are a lot of image formats, but ocrad is able to decode only three
of them; pbm, pgm and ppm. In this chapter you will find command
examples and advice about how to convert image files to a format that
ocrad can manage.

'.png'
     Portable Network Graphics file. Use the command
     'pngtopnm filename.png | ocrad'.
     In some cases, like the ocrad.png icon, you have to invert the
     image with the '-i' option: 'pngtopnm filename.png | ocrad -i'.

'.ps'
'.pdf'
     Postscript or Portable Document Format file. Use the command
     'gs -sPAPERSIZE=a4 -sDEVICE=pnmraw -r300 -dNOPAUSE -dBATCH -sOutputFile=- -q filename.ps | ocrad'.
     You may also use the command
     'pstopnm -stdout -dpi=300 -pgm filename.ps | ocrad',
     but it seems not to work with pdf files. Also old versions of
     'pstopnm' don't recognize the '-dpi' option and produce an image
     too small for OCR.

'.tiff'
     TIFF file. Use the command
     'tifftopnm filename.tiff | ocrad'.

'.jpg'
     JPEG file. Use the command
     'djpeg -greyscale -pnm filename.jpg | ocrad'.
     JPEG is a lossy format and is in general not recommended for text
     images.

'.pnm.gz'
     Pnm file compressed with gzip. Use the command
     'gzip -cd filename.pnm.gz | ocrad'

'.pnm.lz'
     Pnm file compressed with lzip. Use the command
     'lzip -cd filename.pnm.lz | ocrad'



File: ocrad.info,  Node: Algorithm,  Next: OCR results file,  Prev: Image format conversion,  Up: Top

9 Algorithm
***********

Ocrad is mainly a research project. Many of the algorithms ocrad uses
are ad hoc, and will change in successive releases as I myself gain
understanding about OCR issues.

   The overall working of ocrad may be described as follows:
1) Read the image.
2) Optionally, perform some transformations (cut, rotate, scale, etc).
3) Optionally, perform layout detection.
4) Remove frames and pictures.
5) Detect characters and group them in lines.
6) Recognize characters (very ad hoc; one algorithm per character).
7) Correct some ambiguities (transform l.OOO into 1.000, etc).
8) Optionally, apply one or more filters to the text.
9) Output text result.


   Ocrad recognizes characters by its shape, and the reason it is so
fast is that it does not compare the shape of every character against
some sort of database of shapes and then chooses the best match.
Instead of this, ocrad only compares the shape differences that are
relevant to choose between two character categories, mostly like a
binary search.

   As there is no such thing as a free lunch, this approach has some
drawbacks. It makes ocrad very sensitive to character defects, and makes
difficult to modify ocrad to recognize new characters.


File: ocrad.info,  Node: OCR results file,  Next: Problems,  Prev: Algorithm,  Up: Top

10 OCR results file
*******************

Calling ocrad with option '-x' produces an OCR results file (ORF), that
is, a parsable file containing the OCR results. The ORF format is as
follows:

   - Any line beginning with '#' is a comment line.

   - The first non-comment line has the form 'source file FILENAME',
     where FILENAME is the name of the file being processed ('-' for
     stdin). This is the only line guaranteed to exist for every input
     file read without errors. If the file, or any block or line, has
     no text, the corresponding part in the ORF file will be missing.

   - The second non-comment line has the form 'total text blocks N',
     where N is the total number of text blocks in the source image.

For each text block in the source image, the following data follows:

   - A line in the form 'text block I X Y W H'. Where I is the block
     number and X Y W H are the block position and size as described
     below for character boxes.

   - A line in the form 'lines N'. Where N is the number of lines in
     this block.

For each line in every text block, the following data follows:

   - A line in the form 'line I chars N height H', where I is the line
     number, N is the number of characters in this line, and H is the
     mean height of the characters in this line (in pixels).

   - N lines (one for every character) in the form
     'X Y W H; G[, 'C'V]...', where:
     X is the left border (x-coordinate) of the char bounding box in the
     source image (in pixels).
     Y is the top border (y-coordinate).
     W is the width of the bounding box.
     H is the height of the bounding box.
     G is the number of different recognition guesses for this
     character.
     The result characters follow after the number of guesses in the
     form of a comma-separated list of pairs. Every pair is formed by
     the actual recognised char C enclosed in single quotes, followed
     by the confidence value V, without space between them. The higher
     the value of confidence, the more confident is the result.

   Running './ocrad -x test.orf testsuite/test.pbm' in the source
directory will give you an example ORF file.


File: ocrad.info,  Node: Problems,  Next: Concept index,  Prev: OCR results file,  Up: Top

11 Reporting bugs
*****************

There are probably bugs in ocrad. There are certainly errors and
omissions in this manual. If you report them, they will get fixed. If
you don't, no one will ever know about them and they will remain unfixed
for all eternity, if not longer.

   If you find a bug in GNU Ocrad, please send electronic mail to
<bug-ocrad@gnu.org>. Include the version number, which you can find by
running 'ocrad --version'.


File: ocrad.info,  Node: Concept index,  Prev: Problems,  Up: Top

Concept index
*************

 [index ]
* Menu:

* algorithm:                             Algorithm.             (line 6)
* bugs:                                  Problems.              (line 6)
* filters:                               Filters.               (line 6)
* getting help:                          Problems.              (line 6)
* image format conversion:               Image format conversion.
                                                                (line 6)
* input charsets:                        Character sets.        (line 6)
* introduction:                          Introduction.          (line 6)
* invoking:                              Invoking ocrad.        (line 6)
* library error codes:                   Library error codes.   (line 6)
* library functions:                     Library functions.     (line 6)
* library version:                       Library version.       (line 6)
* OCR results file:                      OCR results file.      (line 6)
* options:                               Invoking ocrad.        (line 6)
* output format:                         Character sets.        (line 6)
* usage:                                 Invoking ocrad.        (line 6)
* version:                               Invoking ocrad.        (line 6)



Tag Table:
Node: Top196
Node: Introduction1256
Node: Character sets1990
Node: Invoking ocrad3144
Ref: --filter4090
Ref: --user-filter4467
Node: Filters7505
Node: Library version12290
Node: Library functions12962
Node: Library error codes18080
Node: Image format conversion19631
Node: Algorithm21155
Node: OCR results file22488
Node: Problems24759
Node: Concept index25297

End Tag Table


Local Variables:
coding: iso-8859-15
End: