DocumentOcr Component

Use the DocumentOcr component, available in Pega Robotic Automation in version 8.0 SP1 2006 and later, to convert documents that contain images into searchable text. The document can be an image, such as a faxed document, or a document that contains both text and images. This component can convert image files, PDF files, and Word documents.

Note: The image file types include .png, .jpeg, .bmp, .gif, and .tiff. Graphic file support is provided by ABBYY FineReader. For a comprehensive list of supported types, see Supported Image Formats.

Use the following methods with this component:

GetProcessToXmlConfig (requires 8.0 SP1 2016 or later)
ProcessToPdf
ProcessToText
ProcessToXml (requires 8.0 SP1 2016 or later)

GetProcessToXmlConfig

Use the GetProcessToXmlConfig method to produce information about the XML configuration. You can then pass that configuration information to the configXml parameter of the ProcessToXml method to change the way the output XML looks. This method returns an XML string that contains the configuration options.

Note: The following parameter settings come from ABBYY FineReader and you can turn them on or off to alter the XML output. For example, you can have the output show an XML line per character found or, by changing attributes, you can have it output an XML line per line of text found.

You can include the following parameters:

Parameter	Description
writeCharAttributes	Include the character attributes.
dontWriteBlocksCoordinates	Omit the block coordinates.
writeExtendedCharAttributes	Include the extended character attributes.
writeOriginalImageCoordinates	Include the original image coordinates.
writeNameOfBlock	Include the name of the XML block.
writeCharacterFormatting	Include character formatting information.
writeParagraphStyles	Include paragraph style information.
writePagesByElements	Include the pages by element.
writeAsciiCharAttributes	Include the ASCII character attributes.
writeWordRecognitionVariants	Include word recognition variants.
writeCharRecognitionVariants	Include character recognition variants.
writeLogicalStructure	Include logical structure information.
writeFontStyles	Include font style information.
writeOneCharForTab	Use one character to indicate tabs.
checkResult	Check the results.

ProcessToPdf

Use the ProcessToPdf method to extract text from the images in the input file and output a PDF file. You can then use the PdfConnector or PdfViewer components to extract data from this PDF file. You can use this method with three or eight parameters. If you use this method with three parameters, the defaults are used for the other five parameters.

Number of parameters

Description

Output type

inputFile – (String) Enter the complete path and name of the file that you want to process. You can translate image files, PDF files (.pdf), and Word documents (.doc and .docx).

outputFile – (String) Enter the complete path and file name of the PDF file that the ProcessToPdf method creates.

exportWithoutImages – (Boolean) Use this parameter to include or exclude the images in the output file. Set the parameter to True if you only want the OCR extracted text to display. Set the parameter to False to export the original image with the hidden text. You can select the text when you highlight the image in the PDF file.

Boolean

inputFile – (String) Enter the complete path and file name of the file that you want to process. You can translate image files, PDF files (.pdf), and Word documents (.doc and .docx).

outputFile – (String) Enter the complete path and file name of the PDF file that the ProcessToPdf method creates.

ocrImagesAndText – (Boolean) Use this parameter to determine if ABBYY processes the text in the input file through OCR or passes it directly to the PDF file, without modification. If you enter False, the text in the output file is included exactly as it appears in the input file. If you enter True, ABBYY evaluates the text before it creates the output file, which can lead to inaccurate results. The text in some documents might not be retrieved if you set this parameter False. The default is False.

coloredBackground – (Boolean) If the document has a dark background, change the coloredBackground parameter to True to improve the accuracy of the translation. For instance, if white text is displayed on a black background, set the coloredBackground parameter to True. The default is False.

lowResolutionText – (Boolean) If the text in the document has a small font size, set the lowResolutionText parameter to True. Note that when the system translates images with this setting, the method takes longer to complete. The default is False.

ocrDictionaryType – Use the ocrDictionaryType setting to reduce the character set that ABBYY uses when converting an image into text. For example, if you know that there are no numbers in the image, you can set the ocrDictionaryType option to AlphaOnly to omit numbers from the character set and reduce the possibility of errors, such as changing Search into 5earch. You can choose from the following options:

Normal - Uses a character set that is normal for the language that you are scanning. The default is Normal.

AlphaOnly - Limits the character set to only a-z and A-Z.

NumOnly - Limits the character set to only 0-9.

AlphaAndSymbolsOnly - Limits the character set to a-z and A-Z and special characters such as hyphens (-), colons (:), and other punctuation marks.

NumAndSymbolsOnly - Limits the character set to 0-9 and special characters such as hyphens (-), colons (:), dollar signs ($), and other punctuation marks.

scanLanguage – You can choose from 459 scan languages, including the most widely used languages available, like English, Spanish, and Japanese, and also some combinations of languages, such as Korean and English. Additionally, there are scan language options that limit the character set in a way similar to how you can limit the search dictionary. The following are some examples:

• English US WellKnownCode SSN

• English UK CurrencyByDigits

• English DateTime MonthByWords

Note: Farsi, Arabic, Hebrew, Vietnamese, and Thai are not included. The default is English.

Boolean

ProcessToText

Use the ProcessToText method to extract text from the images in the input file and output a string that contains the text. You can then use this string in a Robotic Automation Studio automation. You can use this method with two or seven parameters. If you use this method with two parameters, the defaults are used for the other five parameters.

Number of parameters

Description

Output type

inputFile – (String) Enter the complete path and name of the file that you want to process. You can translate image files, PDF files (.pdf), and Word documents (.doc and .docx).

extractedText – (String) This output parameter contains the text that the system retrieves during DocumentOCR processing.

Boolean

inputFile – (String) Enter the complete path and file name of the file that you want to process. You can translate image files, PDF files (.pdf), and Word documents (.doc and .docx).

extractedText – (String) This output parameter contains the text that the system retrieves during DocumentOCR processing.

ocrImagesAndText – (Boolean) Use this parameter to determine if ABBYY processes the text in the input file through OCR or passes it directly to the PDF file, without modification. If you enter False, the text in the input file is included exactly as it appears in the input file. If you enter True, ABBYY evaluates the text before it creates the output file. This can produce inaccurate results. The text in some documents might not be retrieved if you set this parameter False. The default is False.

ocrDictionaryType – Use the ocrDictionaryType setting to reduce the character set that ABBYY uses when converting an image into text. For example, if you know that there are no numbers in the image, you can set the ocrDictionaryType option to AlphaOnly to omit numbers from the character set and reduce the possibility of errors, such as changing the text 'Search' into '5earch'. You can choose from the following options:

Normal - Uses a character set that is normal for the language that is being scanned. The default is Normal.

AlphaOnly - Limits the character set to only a-z and A-Z.

NumOnly - Limits the character set to only 0-9.

AlphaAndSymbolsOnly - Limits the character set to a-z and A-Z and special characters such as hyphens (-), colons (:), and other punctuation marks.

NumAndSymbolsOnly - Limits the character set to 0-9 and special characters such as hyphens (-), colons (:), dollar signs ($), and other punctuation marks.

scanLanguage – You can choose from 459 scan languages, including the most widely used languages available, like English, Spanish, and Japanese, and also some combinations of languages, such as Korean and English. Additionally, there are scan language options that limit the character set in a way that is similar to the way you limit the search dictionary. The following are some examples:

• English US WellKnownCode SSN

• English UK CurrencyByDigits

• English DateTime MonthByWords

Note: Farsi, Arabic, Hebrew, Vietnamese, and Thai are not included. The default is English.

Boolean

ProcessToXml

Use the ProcessToXml method to extract text from an image or document and export that text in XML format. You can use this method with two, three, seven, or eight parameters. If you use the two or three parameter variants of this method, the defaults are used for the other parameters.

Number of parameters	Description	Output type
2	inputFile – (String) Enter the complete path and name of the image file or document that you want to process. You can translate image files, PDF files (.pdf), and Word documents (.doc and .docx). configXml – (String) Enter the complete path and file name of the XML configuration file that contains the parameters for formatting the XML output.	String (the XML content)
3	inputFile – (String) Enter the complete path and name of the image file or document that you want to process. You can translate image files, PDF files (.pdf), and Word documents (.doc and .docx). outputFile – (String) Enter the complete path and file name of the XML file that the ProcessToXML method creates. configXml – (String) Enter the complete path and file name of the XML configuration file that contains the parameters for formatting the XML output.	Boolean
7	inputFile – (String) Enter the complete path and name of the image file or document that you want to process. You can translate image files, PDF files (.pdf), and Word documents (.doc and .docx). ocrImagesAndText – (Boolean) Use this parameter to determine if ABBYY processes the text in the input file through OCR or passes it directly to the XML file, without modification. If you enter False, the text in the output file is included exactly as it appears in the input file. If you enter True, ABBYY evaluates the text before it creates the output file, which can lead to inaccurate results. The text in some documents might not be retrieved if you set this parameter False. The default is False configXml – (String) Enter the complete path and file name of the XML configuration file that contains the parameters for formatting the XML output. coloredBackground – (Boolean) If the document has a dark background, change the coloredBackground parameter to True to improve the accuracy of the translation. For instance, if white text is displayed on a black background, set the coloredBackground parameter to True. The default is False. lowResolutionText – (Boolean) If the text in the document has a small font size, set the lowResolutionText parameter to True. Note that when the system translates images with this setting, the method takes longer to complete. The default is False. ocrDictionaryType – Use the ocrDictionaryType setting to reduce the character set that ABBYY uses when converting an image into text. For example, if you know that there are no numbers in the image, you can set the ocrDictionaryType option to AlphaOnly to omit numbers from the character set and reduce the possibility of errors, such as changing Search into 5earch. You can choose from the following options: Normal - Uses a character set that is normal for the language that you are scanning. The default is Normal. AlphaOnly - Limits the character set to only a-z and A-Z. NumOnly - Limits the character set to only 0-9. AlphaAndSymbolsOnly - Limits the character set to a-z and A-Z and special characters such as hyphens (-), colons (:), and other punctuation marks. NumAndSymbolsOnly - Limits the character set to 0-9 and special characters such as hyphens (-), colons (:), dollar signs ($), and other punctuation marks. scanLanguage – You can choose from 459 scan languages, including the most widely used languages available, like English, Spanish, and Japanese, and also some combinations of languages, such as Korean and English. Additionally, there are scan language options that limit the character set in a way similar to how you can limit the search dictionary. The following are some examples: • English US WellKnownCode SSN • English UK CurrencyByDigits • English DateTime MonthByWords Note: Farsi, Arabic, Hebrew, Vietnamese, and Thai are not included. The default is English.	String (the XML content)
8	inputFile – (String) Enter the complete path and name of the image file or document that you want to process. You can translate image files, PDF files (.pdf), and Word documents (.doc and .docx). outputFile – (String) Enter the complete path and file name of the XML file that the ProcessToXML method creates. ocrImagesAndText – (Boolean) Use this parameter to determine if ABBYY processes the text in the input file through OCR or passes it directly to the XML file, without modification. If you enter False, the text in the output file is included exactly as it appears in the input file. If you enter True, ABBYY evaluates the text before it creates the output file, which can lead to inaccurate results. The text in some documents might not be retrieved if you set this parameter False. The default is False configXml – (String) Enter the complete path and file name of the XML configuration file that contains the parameters for formatting the XML output. coloredBackground – (Boolean) If the document has a dark background, change the coloredBackground parameter to True to improve the accuracy of the translation. For instance, if white text is displayed on a black background, set the coloredBackground parameter to True. The default is False. lowResolutionText – (Boolean) If the text in the document has a small font size, set the lowResolutionText parameter to True. Note that when the system translates images with this setting, the method takes longer to complete. The default is False. ocrDictionaryType – Use the ocrDictionaryType setting to reduce the character set that ABBYY uses when converting an image into text. For example, if you know that there are no numbers in the image, you can set the ocrDictionaryType option to AlphaOnly to omit numbers from the character set and reduce the possibility of errors, such as changing Search into 5earch. You can choose from the following options: Normal - Uses a character set that is normal for the language that you are scanning. The default is Normal. AlphaOnly - Limits the character set to only a-z and A-Z. NumOnly - Limits the character set to only 0-9. AlphaAndSymbolsOnly - Limits the character set to a-z and A-Z and special characters such as hyphens (-), colons (:), and other punctuation marks. NumAndSymbolsOnly - Limits the character set to 0-9 and special characters such as hyphens (-), colons (:), dollar signs ($), and other punctuation marks. scanLanguage – You can choose from 459 scan languages, including the most widely used languages available, like English, Spanish, and Japanese, and also some combinations of languages, such as Korean and English. Additionally, there are scan language options that limit the character set in a way similar to how you can limit the search dictionary. The following are some examples: • English US WellKnownCode SSN • English UK CurrencyByDigits • English DateTime MonthByWords Note: Farsi, Arabic, Hebrew, Vietnamese, and Thai are not included. The default is English.	Boolean